
Multi-core scaling: it’s not multi-threaded - lukes386
http://erratasec.blogspot.com/2013/02/multi-core-scaling-its-not-multi.html
======
nkurz
He suggests an interesting approach.

1) Tell the kernel it only has a limited set of cores to work with.

 _The way to fix Snort’s jitter issues is to change the Linux boot parameters.
For example, set “maxcpus=2”. This will cause Linux to use only the first two
CPUs of the system. Sure, it knows other CPU cores exist, it just will never
by default schedule a thread to run on them._

2) Manually schedule your high priority process onto a reserved core.

 _Then what you do in your code is call the “pthread_setaffinity_np()”
function call to put your thread on one of the inactive CPUs (there is Snort
configuration option to do this per process). As long as you manually put only
one thread per CPU, it will NEVER be interrupted by the Linux kernel._

3) Turn off interrupts to keep things as real time as possible.

 _You can still get hardware interrupts, though. Interrupt handlers are really
short, so probably won’t exceed your jitter budget, but if they do, you can
tweak that as well. Go into “/proc/irq/smp_affinity” and turn of the
interrupts in your Snort processing threads._

4) Profit?

 _At this point, I’m a little hazy at what precisely happens. What I think
will happen is that your thread won’t be interrupted, not even for a clock
cycle._

Can anyone remove the haziness? I'm more interested in this for benchmarking
than performance, and wonder how it compares to other ways of increasing
priority like "chrt". Is booting with a low "maxcpus" necessary, or can the
same be done at runtime?

~~~
jamesaguilar
I think he has some iffy premises. I'm not certain, but I guess pthread
mutexes probably use userspace mutexes under the covers. If they don't you
could select a library that does. Grabbing uncontended mutexes should be
basically free. It is definitely not a good idea to write all your own
abstractions, as he suggests, unless you have a really great understanding of
the issues involved. You'd probably still get it wrong sometimes even with
that.

Affinity is definitely valuable, but I don't think you should need to disable
interrupts for most applications. I'm not even sure if it is generally
possible, since interrupts are sometimes used to swap pages or notify the
kernel that a page isn't mapped or that a blocked resource is now available.
The reason affinity is valuable is not because of kernel interactions. It's
because of NUMA and cpu cache swapping. Affinity can prevent thread migration,
which is expensive mainly because data also has to be migrated or else
accessed in a less efficient manner. Likewise, make sure that if you dispatch
an asynchronous call, the handler runs on the same core you sent the call
from.

Finally, it's a common fallacy in these kinds of posts to act as if threads
can't be used to do shared-little or shared-nothing-style multi-programming.
They often aren't, but there's nothing that prevents it.

~~~
javert
_I'm not certain, but I guess pthread mutexes probably use userspace mutexes
under the covers._

Correct. man 7 pthreads states:

 _In NPTL, thread synchronization primitives (mutexes, thread joining, and so
on) are implemented using the Linux futex(2) system call._

A futex is a mutex that only does a system call if there is contention; you
can google it.

~~~
robertgraham
My blog post was quite clear that I'm talking about futexes, and then when it
fails to get a lock that it goes into the kernel. Seriously, that's how
everyone did it from Solaris to Windows before Linux "invented" the concept
AND it's been in Linux for a decade. When somebody says "mutex", you have to
assume they already mean "futex".

~~~
javert
If you fail to get the lock, then you have to wait on the lock, so you might
as well go into the kernel anyway. That's the reason futexes do that, instead
of just letting the thread spin.

True, there are scenarios when you might prefer to spin, but they're pretty
specialized.

 _My blog post was quite clear that I'm talking about futexes_

No, you never mention the word "futex."

 _When somebody says "mutex", you have to assume they already mean "futex"._

I don't think that's a valid generalization at all.

~~~
robertgraham
The entire point of the post was talk about "lock-free" algorithms where two
cores can make forward progress without either having to wait or spin.

What systems have mutexes that aren't built like futexes?

~~~
javert
Kernels, and probably any kind of parallel user-level runtime like Erlang.

------
6ren
Hypothesis: we will never solve multi-core for general purpose computing
(there's also <http://en.wikipedia.org/wiki/Amdahl%27s_law>). But we can do
multi-core for the embarrassing parallelizable - such as graphics (top GPUs
have over 1000 cores), so instead of solving this problem, our focus will
shift to those tasks for which multi-core _does_ work - because it's only
these that keep improving at Moore's Law-like rates.

Arguably, this is already happening.

~~~
joe_the_user
I am not sure what would theoretically prevent each web page and each
application from running on a separate core (or two). Considering the number
of people who love multiple tabs, I can't see how that wouldn't seem like a
win for even thirty or more cores.

~~~
mistercow
The problem is that most web pages and programs don't actually do very much
when they're in the background, and many (most?) of the software that _does_
grind away for long periods of time without user interaction (compilers, ray
tracers, video encoding etc.) are already able to make good use of many cores.
But users who fit that profile are a niche market.

What the vast majority of users need in terms of speed is _acute_ performance.
A dedicated core for every webpage is not very useful because users don't load
up a bunch of tabs at the same time and then flip between them many times per
second, interacting with each one. Instead CPU usage tends to happen in short
bursts directly following user interaction, so even if every tab has its own
core, you won't usually see very much contention for CPU between different
tabs. The same goes for most native programs. Users just don't multitask fast
enough to make separate cores relevant for most programs.

~~~
joe_the_user
I think the problem is that hardcore parallel programmers can't understand the
needs of adhd web browsers.

Some fractions of web pages are all sort of ridiculous Javascript and
seriously slow down my machine.

And I, personally, do flip pages at a significant rate.

~~~
mistercow
>Some fractions of web pages are all sort of ridiculous Javascript and
seriously slow down my machine.

Yes there are exceptions, but not enough to plausibly make use of more than
eight cores. The returns are usually diminishing at even four cores.

>And I, personally, do flip pages at a significant rate.

More than two or three times per second? Because that's what it would take.

------
speeder
I wonder if we will ever figure a way to resume improving clock cycles instead
of adding more parallelism.

Parallelism has two major issues:

First, not all applications need it, in many cases you want to do just a
series of operations in a single starting number, and you don't need anything
else, like if you are for example calculating a factorial, if you need only
one factorial, it is useless to make it more parallel.

Second, it is absurdly hard to code stuff for heavily parallelised hardware,
most coders will make crap code that don't work, no matter how good we become
in making helper libraries, it is totally another way of thinking.

Yes, for some things, like servers, where you can throw a user into each core,
it is nice... But for many other uses, even simple single-core parallelism,
like SIMD, is not much useful.

~~~
xentronium
> First, not all applications need it, in many cases you want to do just a
> series of operations in a single starting number, and you don't need
> anything else, like if you are for example calculating a factorial, if you
> need only one factorial, it is useless to make it more parallel.

Sorry for nitpicking, but calculating factorial can certainly be parallelized.
Easiest way to do this is multiply every n-th number on each core and then
multiply n results together.

~~~
speeder
I doubt that you can have a much increase in performance as cores increase
unless you are calculating numbers with huge amount of bits.

~~~
tbrownaw
Considering that factorial of 1e6 has about 18e6 bits (and factorial of 1e3
has 8.5e3 bits)? Yes, any factorial that doesn't have a huge amount of bits
will be fast enough to calculate that there's not much point to parallelizing
it.

------
jws
_…from 33-MHz to 3-GHz, a thousand-fold increase…_

There had to be a better way to write that. I suppose more work per clock
cycle and increased number of cores contributes the other x10 of raw
performance. But then the author goes on to say they are stuck, which isn't
true of performance, only clock rate. In any event, putting an "up is down" in
your sentence should generally be avoided.

Edit: _The >>>proscribed<<< method for resolving this is a “lock”, where…_
Sigh.

The article covers a lot of ground lightly. It talks about the new Haswell
transactional memory instructions, the way Linux shards network counters, and
a way to make Linux not use a core so you can schedule a process on it that
will never be preempted.

~~~
shawkinaw
I think he just screwed up the math.

------
kyrra
Tangentially related, Snort was doing research to move to a multi-threaded
architecture, but decided against it due to cache synchronization problems
[1]. Though, their thoughts about splitting up processing was quite different
than what the OP blog post suggests.

It looks like Snort gave up on one way of doing multi-threading, but they
could still go the way suggested in the OPs post.

[1]
[http://securitysauce.blogspot.com/2009/04/snort-30-beta-3-re...](http://securitysauce.blogspot.com/2009/04/snort-30-beta-3-released.html)

~~~
robertgraham
Yea, some engineers created a ground-up rewrite of Snort called "Suricata"
that was multi-threaded, and therefore faster than Snort, which is only
single-threaded. Suricata then failed to show any benchmark where they
exceeded Snort's speed, and they fail to mention that Snort works fine on
multi-core.

It's one of those things "everyone knows multi-threaded is better than single-
threaded", but everyone's wrong.

------
javert
So, this post has a number of errors, and is fundamentally wrong.

(a) pthread_mutex_t and friends use futexes, which only call into the kernel
when there actually is contention.

(b) it would be better to use chrt (change to real-time priority) than the
maxcpus trick, because the former accomplishes the same thing, but allows the
core to still be used if the high-priority thread suspends (e.g. to do disk or
network I/o).

(c) Contrary to his claim about Snort, there is no reason to prefer a
multiprocess design over a multithreaded design for a particular application.
There is no savings in overhead or synchronization or anything like that by
going with processes. In fact, using processes and then using memory mapping
to share when you could use threads, is just making things harder for yourself
for no reason.

(d) _What I’m trying to show you here is that “multi-core” doesn’t
automatically mean “multi-threaded”._ Well, in computer science terminology, a
thread is a schedulable entity, and a process is a schedulable entity with
memory protection. So, he's wrong. Lots of developers talk about threads and
processes as orthogonal things, though, so I can see why he made that claim.

(e) _The overall theme of my talk was to impress upon the audience that in
order to create scalable application, you need to move your code out of the
operating system kernel. You need to code everything yourself instead of
letting the kernel do the heavy lifting for you._ That is horrible advice that
is just going to lead to lots of bugs and wasted effort. It's premature
optimization. Even most people using the Linux realtime preemption patch
(PREEMPT_RT) do not have such strict requirements that they need to take this
advice.

(f) _Your basic design should be one thread per core and lock-free
synchronization that never causes a thread to stop and wait._ Might apply to
certain very specific real-time (as in, embedded systems or HFT) scenarios,
but in general, no, you're just wasting the core when that one thread doesn't
need to use it. Prefer real-time priorities if you really need it.

(g) _Specifically, I’ve tried to drill into you the idea that what people call
“multi-threaded” coding is not the same as “multi-core”. Multi-threaded
techniques, like mutexes, don’t scale on multi-core._ Again, you can only use
multiple cores in parallel if there are multiple threads. And multi-threaded
techniques _do_ scale. You definitely may want to use lock-free
synchronization instead of mutexes in some specialized scenarios, though.

EDIT: OK, here is one other thing I forgot in the list above.

(h) _There are multiple types of locks, like spinlocks, mutexes, critical
sections, semaphores, and so on. Even among these classes there are many
variations._ Technically, mutexes and semaphores both are ways of protecting
critical sections, and a spinlock is a way of implementing a lock (including,
possibly, a mutex or semaphore lock). Again, this is to some degree the
difference between developers with a shared lingo and computer scientists. But
if you go by that kind of lingo, you're missing part of the picture.

~~~
robertgraham
(a) What part of "In the Linux pthread_mutex_t, when code stops and waits, it
does a system call to return back to the kernel" do you not understand? That's
how "futex" works: when it fails to get a lock, it must stop and wait, and
therefore does a system call. When it doesn't have to wait because there is no
contention, then it doesn't make a system call.

(b) The entire point is that for really scalable network applications, you
don't want the high-priority thread to suspend for things like disk IO.
Really, you read all that and expected the thread to use blocking IO instead
of asynchronous IO???

(c) The point wasn't that Snort's multi-process design is better, only that
it's acceptable. It's a single-threaded design written in the 1990s that has
lots of quirks that make it hard to convert to multi-threaded operation. The
point is that you can still get it to multi-core without having to make it
multi-threaded.

(d) How is "multi-core doesn't mean multi-threaded"??? Snort is a multi-core
app today and isn't multi-threaded.

(e) I keep repeating the claim because my code written in user mode scales
better than the code of people who disagree. My code scales to 20-million
connections, 20-gbps, 20-million packets/second. What does your code scale to?

(f) "Real-time" is different than the "network" apps I'm talking about. The
entire idea is "control plane" vs. "data plane" separation. Control operations
that need real-time guarantees are very different than high-throughput data
operations.

(g) Multi-threaded techniques that try to interleave multiple threads on a
single core suck for high network throughput. Just ask Apache

~~~
javert
A lot of what you're saying hinges on being specific to what you call "really
scalable network applications," but you never said anything about that in your
post.

 _(d) How is "multi-core doesn't mean multi-threaded"??? Snort is a multi-core
app today and isn't multi-threaded._

This is, again, between whether you want to be "computer science correct," or
use developer lingo that is not even necessarily consistent among all
developer and OS subcultures. I think saying it the way you are saying it,
instead of "You could use multiple processes to get parallelism, instead of
multiple threads," is misleading.

 _What does your code scale to?_

Well, I'm a real-time guy, so my thread scales down to microsecond overheads.

Did you change the text in your post that explains pthread_mutex_t? If not, I
probably made a mistake to pick on that, because what you have there is pretty
clear.

------
nonamegiven
"Multi-tasking was the problem of making a single core run multiple tasks. The
core would switch quickly from one task to the next to make it appear they
were all running at the same time, even though during any particular
microsecond only one task was running at a time. We now call this “multi-
threading”, where “threads” are lighter weight tasks."

I must have missed something.

Multi-tasking is multiple processes, which mostly have nothing to do with each
other, switched in and out of the processor(s) by the OS, which do not share
in-process memory or context. The programmer does nothing to make this happen,
and normally has little to no say in it.

Multi-threading is a single process, where the threads _carefully_ share
context and memory, and they're all working roughly on the same thing; the
programmer makes this happen explicitly, and usually fucks it up.

<https://en.wikipedia.org/wiki/Multitasking>

<https://en.wikipedia.org/wiki/Multitasking#Multithreading>

------
stefantalpalaru
"Multi-threaded software goes up to about four cores, but past that point, it
fails to get any benefit from additional cores."

Is there any basis for this affirmation or just the fact that his system has
only 4 cores?

~~~
montecarl
This article speaks in broad generalities that seemed to be based on the
authors particular interests.

There is no reason why multi-threaded software can't scale near linearly for
the right problems.

~~~
robertgraham
My point was to talk about the wrong problems.

Networking is actually a "right" problem, and should be nearly embarrassingly
parallel since two cores and process two unrelated packets at the same time.
But, if you look at network stacks on open-source projects, you see a lot of
fail. That's the point of my post, as a reference to point to why that
spinlock in your networking code is a bad idea, and why it's probably better
to replace it with an atomic operation or a lock-free alternative.

------
javert
_There are two fundamental ways of doing this: pipelining and worker-threads.
In the pipeline model, each thread does a different task, then hands off the
task to the next thread in the pipeline._

Why not just implement the pipeline entirely in one thread, and then replicate
them (just like worker threads)?

What will happen is that the first worker thread will be executing stage 2,
while the second worker thread is executing stage 1. The OS will automatically
schedule them on different cores.

Am I missing something?

~~~
robertgraham
When there is high-contention for a resource, it's better that one thread do
it and access it contention-free, rather than make multiple threads content
for it.

Even so-called "lock-free" synchronization has locks, they are just very short
(30 clock cycles). Therefore, you still want to avoid contention if you can
figure out a way to do it.

I didn't really go into enough detail in my example, but pulling packets off
the network is a good example. You can have one thread do it, and therefore
need no contention. Then you can setup multiple single-producer/single-
consumer ring-buffers to forward those packets to worker threads to complete
the processing of the packet. Thus, you essentially get rid of all the
atomic/lock-free contention you would otherwise have.

~~~
javert
* When there is high-contention for a resource, it's better that one thread do it and access it contention-free, rather than make multiple threads content for it.*

Right, and that's what I'm suggesting. So, if you have a pipelined
architecture, keep the pipeline inside worker threads, instead of across them,
except when you need to distribute work to the workers (i.e., the first stage,
where you do something like take packets from the network). I think we agree
on all that. I was just curious if there was ever a reason to do it the other
way, i.e., having a separate thread for each stage of the entire pipeline. It
seemed like you were suggesting that was useful in some cases, but perhaps I'm
reading into things too much.

------
miga
I recall similar results on nearly all applications since my late MSc study:
Mutexes are bad, pipes and sockets give better scaling. Thread sync primitives
just sometimes scale up to 8-12 cores, but indeed multiprocess applications
usually get much faster. In the age of GCed VMs one needs to also consider
sync cost of GC.

------
abraininavat
Maybe I'm missing something, but I'm not getting the point. It seems to me
there's no fundamental difference between multiprocess with shared-memory
regions for anything that needs to be shared and multithreaded with mostly
thread-local storage plus some shared data. The kernel is going to multiplex
your single-threaded processes among the available cores just the same as it
will multiplex your multiple threads among the available cores.

 _Multi-threaded techniques, like mutexes, don’t scale on multi-core.
Conversely, as Snort demonstrates, you can split a problem across multiple
processes instead of threads, and still have multi-core code._

Synchronization is synchronization. There are inter-process synchronization
primitives, including mutexes. And you can use lock-free synchronization in a
single-process multi-threaded scenario.

