
Costs of multi-process architectures - cronaldo
http://aseigo.blogspot.com/2014/07/multi-process-architectures-suck.html
======
roeme
My TL;DR: Multi-process architectures may suck on machines where you don't
have enough CPU/Cores for your workload; or the design of your software
requires a lot more processes than common sense would dictate.

~~~
easytiger
My constant reluctance of things like goroutines/coroutines is that if they
truely get executed in parallel there is little to know way to manage their
machine load as one would in a distributed pipeline.

~~~
robotresearcher
The golang runtime allows you to specify how many processes it will use. It
schedules goroutines one after the other inside that number of processes. If
you specify max 1 process, you have no parallelism, just concurrency.

~~~
MichaelGG
Does it have a priority system? Or a way to setup pools? I often find I want
to run some part of a program (some of its tasks) on a set number of threads
at a certain priority, not throw tasks into some global pool.

------
gaelow
Oh come on! How can you say that? Of course there's a cost on multiprocess
arquitecture, but it doesn't suck at all! Ok, you shouldn't just spawn new
processes whenever you need more tasks running concurrently, but how could a
modern OS exist without multiprocessing? What would you have us do, go back to
MS-DOS? :-)

~~~
zobzu
If you read his article - mainly message passing and consequent context
switching.

Most programs with protected memory - thats anything after windows ME in the
windows world, incl. win 98, 95 for example (ie not just ms dos) - don't need
too much message passing.

Stuff like webservers do (hence the perf advantage of nginx over apache, until
they brought mod_event). Heck even web browsers do quite a bit despite his
point - they do what they can to send as few messages as possible to
compensate.

~~~
gaelow
Yeah, actually I agree with everything on the article but the title. The
article does too :-)

------
flohofwoe
I think the core conflict is synchronization overhead vs latency, no? If
communication is too granular there's too much synchronization overhead, which
can be fixed by buffering, which in turn increases latency. Game engines have
been swinging back and forth over the past 10 years on this a lot. First the
trend was to have a few "fat threads" (maybe
input+gamelogic->physics->visibility->rendering) which ran in parallel but
which were coupled like a pipeline working on the previous frame's data. Each
pipeline stage meant one frame more latency. Add the rendering API/driver
latency, plus whatever the display device adds, and suddenly games had
something like 100ms latency or even more which is very noticeable. Then
people started to make the game loop a simple sequence of subsystem stages
again, but subsystems split their work on the current frame internally into
small parallel tasks. It will be interesting what the perfect game engine
architecture will look like for VR with its ultra-low latency requirements
from sensory input to display update.

------
mpyne
I could have sworn that Aaron had mentioned that this story was a counterpart
to his recent "Multi-process architectures rock! :)" post that all of you
defending multi-process architectures apparently didn't find. It's worth a
read as well since many of the arguments you've made, Aaron already made.

------
tonyg
Erlang does quite nicely as a multi-process architecture. Low context switch
overhead, low per-process overhead, built for redundancy and fault-tolerance
(implying multiple execution units), tool support. There's more than one way
to do it - all the world's not Unix.

------
CraigJPerry
These are all problems that have easy solutions - chuck more CPUs at it.

Back in the ideal world though, I would say the harder problem is still not
these, it would be the increased operational complexity. Support and
maintenance costs go up.

I think the problems are somewhat over played here too. We've had zero copy
network stacks for a while. Stack sizes are not a concern because scheduling
becomes a bigger problem faster than using small chunks of memory does as you
scale.

Edit: forgot to mention isolcpus or pinning as one method to solve the context
switching - of course frequently we can make greater impact by careful
consideration of the syscalls we make from our code.

~~~
astrobe_
> These are all problems that have easy solutions - chuck more CPUs at it.

So basically you tell us that if your application that uses bubblesort
everywhere is too slow we just have to "chuck more CPUs at it"? No Thanks.

That line of reasoning makes me angry twice: as an embedded systems engineer
(for which that kind of "solution" is simply not an option) and as a user
(when I see applications take more and more megabytes and CPU cycles without
proportionally increasing functionality).

~~~
CraigJPerry
This, is the way of the world.

Don't get angry about it though. Instead make it easier for stakeholders /
customers to choose to do "the right thing". Be transparent and accurate with
time-to-fix estimates. That's a hard problem, i don't even think it's
perfectly solvable but i try hard anyway.

The uncomfortable truth - sometimes "the right thing" is to throw more CPU at
it. Being late to the consumer electronics market for example, even if your
product is bug free and perfectly designed, means you lost a whole chunk of
available cash to the guy who got there first... With that money, he may be
enabled to make a v2 that completely trumps yours.

------
_h__
Cant agree to the points. It all depends on use case. In some cases it is even
beneficial to remove the OS itself. I am not talking about high lveel OS like
Linux, QNX or VXWorks, even MicroC OS II is a big overhead for some systems.

Consider how your car decides to deploy the airbags. Do you want a message
queue? No, as soon as hardware inputs meets the condition the airbags needs to
be deployed. On the other hand on the same car will have a infotainment system
that has VxWorks/QNX/WinCE with multi process architecture. Most of them even
have separate processor to interface vehicle CAN bus and power management.
Inside the application processor, graphics and HIM will be distributed in some
processes. The low level drivers codecs in another set of processes. The whole
thing will just give the user a mediaplayer, a map and a phone interface. Some
OEMs (e.g. Daimler, Ford) even distribute this whole functionality across
different H/W modules.

Dividing gives you maintainability, re-usability, drop-in replacement
alternatives. Most of the cases it shortens the engineering time, improves
quality and reduces product recall.

The above automotive embedded example is just one use case. There are many
areas where you want to distribute your application in many ways.

Finally, I want to ask one small question, when you turn off the reading light
in a passenger aircraft, how many processes do you want it (Switch OFF signal)
to go through before the light turns off? and why so?

------
strstr
I'm guessing the study from 2007 is a bit stale now. Intel/AMD/... have almost
certainly been trying to decrease the penalty for context switches. I'm
curious how much they've changed over time.

~~~
sliverstorm
I'm not an OS guy, but most aspects of context switching is implemented in
software. It seems to me the only places in hardware that could help, would be
faster memory access (which is more of a general optimization!) and the TLB.
Awkwardly though, x86 doesn't seem to have a TLB insert command, so you just
have to miss...?

Oh, and you could shorten the pipeline to make flushes faster and the penalty
smaller. But pipeline length has stayed fairly static.

~~~
rockdoe
It's interesting that x86 only seems to have TLB tagging for VM guest/host
switching usage. Regular context switches require TLB flushing, which (reading
the Linux kernel) either flush the entire TLB or a specific set of entries.

I'm surprised no (more) tagging is used here for regular switches. Just not
worth it?

Edit: A colleague pointed me to this:
[http://www.google.com/patents/US6510508](http://www.google.com/patents/US6510508)
which is used by recent AMD cpus. Anyone know any resource that has more info
like this centralized?

------
pjmlp
So they just discovered how multicore programming was back in the day OS could
only juggle processes.

Worse, I see no mention of modern micro-kernel OS, many of them used
commercially, where the performance is actually quite good.

------
mwcampbell
I wonder if anyone has recently measured the time it takes to do a context
switch under Linux on a modern x86-64 processor. Something like the numbers in
the table here:

[http://blog.codinghorror.com/the-infinite-space-between-
word...](http://blog.codinghorror.com/the-infinite-space-between-words/)

------
signa11
why not just pin processes to cores e.g. set-scheduler-affinity and friends on
linux ? that works right ?

~~~
corysama
Apparently doesn't work on Android. At GDC this year I was in a hallway huddle
of high-end mobile game devs complaining about how wide parallelism does work
on Android because the scheduler stops listening to your affinity requests and
just schedules everything in serial on one core so that it can save power on
the other three.

------
PeterGriffin
I don't know why the author called this "Multi-process architectures suck :("
when he really meant "I suck at multi-process architectures :(". Look at what
he lists:

\- "Context Switching"

You can find 4 cores in a trivial computer these days (8 cores with
hyperthreading). This means you can have 8 processes without context switching
at all, and also suggests that if you don't use multiple processes, you can
only reach about 10-20% capacity on a multi-core machine.

\- "Per-process overhead"

It's true it has overhead, but you don't have to create _thousands of
processes_ , we have light concurrency patterns to use within the actual
thread/process.

But even then, you don't even have to run more than one primary process per
core. You _have multiple cores_. They can't together run _one_ process.

\- "We built for a single-core world" and "We lack the tools"

Those are outright PEBKAC errors, and not architectural problems. We're way
past the stage when multi-process architectures were a hardly understood,
confusing problem. We have the right tools, for those who look for them.

Cache misses and overly eager context switching _are_ bad, and you can read a
lot about this by Martin Thompson on his Mechanical Sympathy blog:
[http://mechanical-sympathy.blogspot.com/](http://mechanical-
sympathy.blogspot.com/) There are ways to design multi-process systems to take
best advantage of all cores, and your machine caches.

But even Thompson's Disruptor architecture is based around message passing and
multiple processes. Because, again, in a multi-core world, suggesting anything
else is laughable.

Plus, forget computers, multitasking is a fact of life. Animals do it, humans
do it, so do computers. We have to check email and talk on the phone at the
same time sometimes. We need to walk and chew gum. We constantly argue and
write about how much we should multitask and how much we should focus on a
single task, because people have the same context switching overhead and so
on. Well for both people and computers, there's a balance, and either extreme
is counter-productive. It's as simple as that.

~~~
Karellen
"You can find 4 cores in a trivial computer these days (8 cores with
hyperthreading)."

There are still plenty of tablets and phones out there which don't. (And the
author _is_ targetting those devices with the software he's writing)

"It's true it has overhead, but you don't have to create thousands of
processes,"

Ah, but that might be a natural consequence of your design. I think part of
the problem is that we're currently in the middle of a really awkward
transition period in computer architectures.

As you point out, almost no computers are single core nowadays. So having a
single-process application is obviously not taking advantage of all the
computer power you have at your disposal. It's blatantly sub-optimal.

However, if you're going to design a multi-process application framework, you
probably want to design it to kick off new processes whenever you're about to
do something non-trivial that could introduce significant latency into the
mix. But depending on what the user/application user is actually doing, that
_might_ potentially end up starting dozens, or hundreds, or even thousands of
processes.

And our computer architecture is not yet at the point where we have "enough"
cores in our CPUs that we can just do that and have it work. We're getting
there, but it's likely a decade or maybe two away.[0]

So, we want to create a multi-process architecture to stop wasting the
computing power that exists; but we have to be careful and write extra code to
managing creating these processes, because we can't yet afford to create them
"on a whim".

It seems similar in some ways to segmented 16-bit memory models. In an early
flat 16-bit memory model, you were very constrained (64k) but at least the
environment you were working in was simple. The move to a segmented 16-bit
model was in some ways a lot less constrained, but taking advantage of it
meant dealing with a bunch of extra complexity. (boo!) It was the next step to
flat 32-bit systems which made working with memory painless and simple again,
while further lessening the constraints.

When low-end phones and tablets have >=256 cores, then we'll be able to take
advantage of multi-process frameworks properly.

It occurs to me, having seen Alan Kay's talk a few weeks ago and being
fascinated by the idea of "spending money to get ahead of Moore's law" to put
together a computer that would allow you to write the sort of software to take
advantage of computers a decade from now, but couldn't figure out what that
meant if you wanted to do it now, I imagine that putting together a >=256 core
system might be a good start.

~~~
astrobe_
I doubt tablets will ever need 256 cores. Desktop computers don't need that
much either if you notice that many of them just run a browser and Excel. It's
really when you do specific stuff like for instance gaming that you really
need more horsepower. 2-4 cores are enough probably till the end of the decade
at least: it allows many people to run 2 or 3 programs smoothly. A corollary
is that for a program to try to use every core may not be a so good idea in
the big picture.

Furthermore, if one observes the hardware evolution of the PC, one notices
that it took the direction of heterogeneous multicore architecture (CPU, GPU,
etc.) rather than the direction of an homogeneous multicore architecture:
there are more "cores" of different types in your PC outside your i7 than
inside. The same goes for tablets and phones that typically feature an ARM
design with CPU, GPU, DSP integrated into one chip for footprint and power
consumption reasons. Architectures seem to evolve into modular hardware,
featuring a base CPU backbone to which specialized chips are added. There are
a gazillion ARM-based designs, depending which set of functions you need.

This makes sense for the software and consumer electronics industry. Switching
to completely different solutions like many-core chips (Mill CPU, GreenArray),
would have a huge cost. Picking a common, general purpose "few-cores" CPU and
adding in specialized chips as needed is much more affordable.

~~~
Karellen
"I doubt tablets will ever need 256 cores. Desktop computers don't need that
much either"

And 640k will be enough for everyone.

Hey, you might even be right about not needing them, but that doesn't mean
those devices won't get them. You don't really need a 32-bit CPU to run your
microwave or washing machine, but very few of the ones you buy today are
running 8-bit microcontrollers with hand-coded assembler.

Though clock speeds stopped increasing some time ago, Moore's law marches on,
and we're still getting more and more transistors per buck. Sure, some of them
will go to more L1/L2 cache, but at some point the bandwidth to flush them to
main memory becomes a bottleneck, so I think we're going to see more and more
cores/chip. As that marches on, I think it's going to trickle down to even the
low end of CPUs. For instance, you might even be able to buy a bunch of
64-core chips which have a bunch of cores disabled really cheap, because the
disabled cores failed factory testing and were switched off. It can't be sold
for full price any more, but that doesn't mean it's useless.

And once your $5 CPUs have 32 cores, well, providing the tooling is there, you
might as well use a framework that makes use of them.

"many of them just run a browser and Excel."

Bad choice of examples there, I think. Browsers can be pretty multi-process
heavy these days, with one main process, plus one per tab, plus extra
processes for e.g. decoding streaming video in the tab, or running your (
_spit_ ) EME plugins ( _spit_ ) in a sandbox.

Similarly, with Excel, sure most of the time it doesn't do much. But, if
you've got a spreadsheet with a bunch of dependent cells/formulas, then if
someone updates the right cell, being massively parallel could really speed up
value recalculations and propagation throughout the sheet. Some spreadsheets
translate really well to map/reduce, and using all the cores could really help
there.

------
cryptophile
Enlisting more cores in order to get something done faster, you know, by
splitting the work is indeed a serious pain.

I recently wanted to get lpsolve to split an integer branch-and-bound
programming problem across multiple cores and then get a large on-demand AWS
instance to deal with.

The branch-and-bound algorithm is eminently parallellizable. So, it should
have been possible.

I came to the conclusion, however, that I would have to rewrite lpsolve for
that. That program sticks to one process and there is no way to get it to fork
other processes and read back the results.

~~~
skj
The tendency is for things like lpsolve to be written single-process, because
typically when you need to do it once, you need to do it one thousand times,
and then your distribution is using each core available to you for a single
lpsolve instance.

1000 iterations of an lpsolve invocation running on a single core, is going to
run faster than the same number of lpsolve invocations each running on 10
cores.

------
kazinator
I get nothing when viewing the site, just a blank page with a list of URLs
generated by NoScript. Not a shred of content to be seen without allowing
Javascript. It has the trappings of a malicious page.

~~~
kazinator
Sorry about the unconstructive comment, folks. Does anyone have an URL that
just serves up HTML with the text of the article? Thanks.

