
Think Before You Parallelize: A tour of parallel abstractions - adamnemecek
https://jackmott.github.io/programming/2016/08/30/think-before-you-parallelize.html
======
vvanders
Just a small nit:

> Another scenario, say you are working on 3D game, you have some trick
> physics math where you need to crunch numbers, maybe adding realistic
> building physics to Minecraft. But separate threads are already handling
> procedural generation of new chunks, rendering, networking, and player
> input. If these are keeping most of the system’s cores busy, then
> parallelizing your physics code isn’t going to help overall.

Generally game engines have been migrating towards work-stealing based tasks
architectures. Having monolithic thread based systems(one per physics,
rendering, gameplay, etc) were great for migrating from the single-threaded
games of old, however it leads quite often to idle threads.

This was even more critical in the PS3 era where you had SPUs with just ~256kb
of RAM. Overall it leads to an architecture that scales well to whatever
platform you end up targeting since the CPU/compute capabilities of various
platforms can be pretty disparate.

~~~
stcredzero
_Generally game engines have been migrating towards work-stealing based tasks
architectures._

How are Golang goroutines for implementing work-stealing?

~~~
vvanders
From what I understand of goroutines the language is already handling
distribution of running them so there's no work-stealing to implement.

For a game engine where task based structure is important Go would be a pretty
poor choice. You don't want a GC in your inner engine loop. In addition Go has
pretty poor semantics for explicit memory layout, something that can be
critical for performance. It's very common to issue cache prefetches for the
next task as one is spinning down for instance.

~~~
stcredzero
_For a game engine where task based structure is important Go would be a
pretty poor choice._

I have a game server written in Go.

 _You don 't want a GC in your inner engine loop._

Why not? I expect my typical GC pause times to be 1 or 2 ms under Go 1.7. But
if my server had pause times of 20ms, I wouldn't care, so long as it didn't
happen more often than once a second or so.

 _It 's very common to issue cache prefetches for the next task as one is
spinning down for instance._

Yeah, I've had Go guys say I should give every agent its own goroutine. As it
turns out, this can cost 150-300 microseconds every time one of them wakes up.
I have a traditional game loop right now.

~~~
vvanders
> I have a game server written in Go. Game engine != game server.

> Why not? I expect my typical GC pause times to be 1 or 2 ms under Go 1.7.
> But if my server had pause times of 20ms, I wouldn't care, so long as it
> didn't happen more often than once a second or so.

We're talking about client-side latency here. You've got 16.6ms to do
everything in a game engine. Even 2ms is 1/8th of your frame and a significant
amount of work. 20ms is two missed frames which is _really_ bad.

The types of engines where work stealing and task based architecture are
prevalent are usually of the AAA variety, where performance is critical and
the scenes are either very complex or very large. In this space GC based
languages have been limited in scope to gameplay scripting and even then are
usually very fast(Lua or equivalent).

~~~
MaulingMonkey
> Even 2ms is 1/8th of your frame and a significant amount of work.

And this is targeting a mere 60hz. Oculus DK2s overclocked their LCD panels to
run higher than spec (75hz) and shipped even higher refresh rates in CV1
(90hz) because of it's importance in reducing nausea. To say nothing of the
enthusiasts running 120hz or 144hz displays - in which case you've blown
through more than 1/4th of your frame budget 'at random' (read: you must
assume it could happen in any frame) unless you can control when GC occurs.

> 20ms is two missed frames which is really bad.

If you're doing VR, this is bad enough that some of your customers may hurl.
I've gone after fixing single frame hitches which were caused by as little as
an extra 1ms spike at exactly the wrong time in _non_ VR games. Unfixable 20ms
spikes would be a total non-starter.

There's a lot of applications where you can tolerate an extra 1-2ms extra
spike. A game server, where network latency is an order of magnitude or two
larger, will probably almost always count. And if your improved productivity
from using the language lets you optimize away an additional 1-2ms cost
elsewhere that you wouldn't have time for otherwise, it can even be worth it.

For me, there's so much stuff that I use to boost my productivity, that's
missing from Go - by design no less - that my productivity is going to be
going the wrong way.

------
wyldfire
> Don’t Parallelize when it is already Parallelized

For the uninitiated -- there's probably a large list of "don'ts" that would be
good here, too. But maybe a prerequisite is ideal: why are you trying to write
[this code] to execute in parallel? "It's slow" isn't good enough. You can
save yourself a lot of time and headache if you use profilers to understand
what the system is doing that makes it slow.

Using that profiler will at least help partition the problem into one of two
big domains: compute resource bound (cpu/bus/mem/cache etc) or maybe it's just
pending results from some other async task in the system. The latter happens
much more often than you might expect. Filesystems, critical locks/exclusion,
networks, databases, IPC -- there's lots of stuff that having multiple tasks
in parallel for might not help much.

~~~
stcredzero
There were times/places when/where programmers in the boonies of the US had
funny ideas about pre-optimization and dynamic languages. (Specifically,
Ohio!) Basically meetup hecklers who had the weird idea that you should write
super-hacky-optimized code 100% of the time, or it would never be fast enough.
So then I asked the room to raise their hand if they had ever used a profiler
on a real production system. Then I asked them to leave their hand up if
they'd never ever been surprised by what the profiler told them!

~~~
jackmott
So the lesson is.... Write super hacky optimized code 100% of the time, but
constantly profile it too!

~~~
stcredzero
The lesson is: If you have a TARDIS and you find yourself in Ohio in the late
90's, don't bother going to a programmer's meetup.

~~~
jsmith0295
Now we have the opposite problem, at least in my experience. Everyone just
says "premature optimization is the root of all evil" and then uses that as an
excuse not to have a sane architecture just to save a little effort up front.
After all, that would slow down the rate at which they could complete user
stories for the week...

~~~
stcredzero
Yet another swing of the pendulum.

------
paulddraper
Good article. Disagree with this:

> What you forgot was that the web server was already parallelizing things at
> a higher level, using all 24 production cores to handle multiple requests
> simultaneously.

The reason you didn't see improved performance is because you server's load is
high; it's maxed on the work it can do. Obviously no amount of parallelism
could fix that.

Double the servers processing requests and your then your nifty in-memory
parallelism can make the difference you intended. Note: had you doubled the
servers _without_ adding your extra parallelism, the extra hardware would not
necessarily make a difference.

You did a good thing; you were just dealing with overloaded hardware.

------
weinzierl
I'm positively surprised that Rust does so well. Not only the performance
results are good, I also like the brevity and elegance of the code. The only
downside is that the examples don't work out of the box, but require an
external library.

~~~
chrismorgan
… an external library that is _really_ easy to add. One line to Cargo.toml
(plus another, “[dependencies]”, if you don’t have any dependencies already),
one line to src/lib.rs and one line importing the appropriate trait to any
module you wish to use `.par_iter()` in.

~~~
jackmott
I did put it below Java on the table because it wasn't build in, but yes, the
experience of adding it was very easy. I learned Rust...well I Wouldn't even
say I have learned it yet. Didn't have any trouble.

SIMDArray for F#/C# will be similarly easy when it is a nuget package. It just
puts extension methods on Array.

------
jbooth
One more: Don't parallelize when you're going to blow L1 caches for no good
reason.

I wrote a gradient descent solver a while back, before I realized Vowpal
Wabbit existed.

Vowpal Wabbit had a dedicated thread to read in records and pass them through
a queue to another thread doing the math. I didn't, because it was a quick POC
impl and I couldn't be arsed. My impl was faster.

This advice probably goes for most usages of auto-parallel collection impls
that the author was pointing out.

~~~
jackmott
That is a good point. If you have around N things going on, and N cores, and
each thing can saturate the 1 core pretty well, it is better to keep them each
on one core so they can rip through their own L1 cache rather than ruin each
others.

The loop abstractions don't give you the control to assure that, though I
would hope the good ones are smart enough to at least keep each thread on 1
core and on it's own chunk of an array.

That would be interesting to explore.

~~~
chubot
I call this heterogeneous vs. homogeneous threads/processes. Heterogeneous
threads are more icache and dcache friendly. They also make your program more
modular.

Linux has controls to set the affinity of threads for cores.

------
throwanem
I must say this is a remarkably cogent and thorough treatment of parallelism
for performance gains in Javascript especially. Well done!

~~~
jackmott
If I blogged on medium.com I would have had an animated gif from southpark in
that section.

------
sidlls
I would also add a caution to constrain the scope of parallelism in the
implementation to only those bits of code that actually benefit from it, and
to be extra cautious when refactoring an existing serial implementation into a
parallel algorithm.

------
tormeh
You should also think about the context in which the statement will be used.
Is your statement normally the bottleneck? If not, does it matter if it's 3x
slower for light workloads (which are fast anyway) as long as it's 4x faster
for heavy workloads(where the performance difference can be felt)?

------
habosa
> SIMDArray is ‘cheating’ here as it also does SIMD operations, but I include
> it because I wrote it, so I do what I want. All of these out perform core
> library functions above.

I love the brief interjection in the otherwise objective discussion.

------
merb
> So think about the system your code is running in, if things are getting
> parallelized at a higher level, it may not do any good to do it again at a
> lower level. Instead, focus on algorithms that run as efficiently as
> possible on a single core.

Generally this is a extremly bad advice. Especially with:

> What you forgot was that the web server was already parallelizing things at
> a higher level

Apache parallized like that. and it was bad.

Nginx then introduced a eventloop, which was way better. however your
application still needs to deal with it. i.e. if one request is blocking
(taking too much time) your nginx will block, too. if that happens on all
worker threads your website will be extremly slow.

A EventLoop mechanism always means that your stuff should be non-blocking and
that is hard. this is especially hard when dealing with external resources,
i.e databases, external services, caches, file io. when you just say "well my
webserver will handle that" you will be in a really really bad shape.

~~~
jackmott
> Nginx then introduced a eventloop, which was way better.

>A EventLoop mechanism always means that your stuff should be non-blocking and
that is hard.

hmm ;)

So you are saying you need to think about the system your code is running in?

Even an Event Loop server, if CPU utilization is sufficiently highly,
paralellizing a function can be counter productive due to overhead and cache
thrashing.

~~~
merb
> if CPU utilization is sufficiently highly

that won't be the case in a EventLoop system, it could be the case if you just
have too less hardware. But than even your "non paralized" code will fail.

------
avodonosov
It seems to be the same single abstraction, just in several languages

------
bluejekyll
This is totally awesome. And I'm always blown away at how performant Java is.
For some reason people give it a bad rap, but it's still awesome.

Though, one thing I wish was included is memory utilization. I would expect
Rust and C++ to be significantly better than Java in this realm. (Due to the
overhead of the JVM)

~~~
steveklabnik

      > For some reason people give it a bad rap,
    

A lot of people have outdated opinions about Java, I know I did for a long
time. Initially, the performance wansnt there, so people held that opinion
without updating it when Java updated. Things are very different today.

~~~
stevoski
> Things are very different today.

Java has been fast for at least ten years.

Except start-up time. Java still has way too long a startup time, which you
notice in desktop apps implemented in Java.

~~~
zigzigzag
HelloWorld.java runs in about 50msec on my machine.

However, there are lots of crap desktop apps written in Java that don't care
about startup time. You can write apps with bad startup time in any language:
look at OpenOffice (C++).

------
chadscira
Anyone trying to do this in javascript (universal) could check out Task.js [0]
for a simple interface.

\- [0]
[https://github.com/icodeforlove/task.js](https://github.com/icodeforlove/task.js)

