
The Parallelism Blues: when faster code is slower - bibyte
https://pythonspeed.com/articles/parallelism-slower/
======
longemen3000
In Julia, where the paralleization options are explicit (SIMD, AVX, threads or
multiprocessing), it always depends on the load, for small operation (around
10000 elements) a single thread is faster only for the thread spawning time
(around 1 microsecond). And there is the issue of the independent Blas
threaded model, where the Blas threads sometimes interfere with Julia
threads... In a nutshell, parallelization is not a magical bullet, but is a
good bullet to have at your disposal anyway

~~~
The_rationalist
Do you know if Julia will add OpenMP support? It's clearly the way to go for
offloading to hardware in a productive way.

~~~
dagw
Julia is actually initially had OpenMP backed parallelism
(ParallelAccelerator.jl), but they're moving away from OpenMP towards a novel
and native task parallelism framework more inspired by things like Cilk[0].

[0]
[https://julialang.org/blog/2019/07/multithreading/](https://julialang.org/blog/2019/07/multithreading/)

------
AzN1337c0d3r
There's also the problem of Turbo Boost.

My laptop's 9980HK will boost to ~4.5 GHz when only loaded to a single core.

However, when I load up all 8 cores, it might only sustain ~3.5 GHz.

Therefore the 8 cores might not actually result in the work being completed 8
times as fast, only 6.2x (8*[3.5/4.5]) real-time due to the lowered clock rate
of each individual core.

This will show up as additional user time, since each individual core is able
to do less work for each unit of time (seconds) compared to the single-core
case.

------
maweki
"It would be extremely surprising, then, if running with N threads actually
gave ×N performance."

Basically impossible by Ahmdal's law.

~~~
twoodfin
Of course in the general case Amdahl's law is inescapable, but some tasks on
modern systems can show > ×N speedup over single-threaded performance if, for
example, a single thread can only exploit at maximum some fraction of the
total memory bandwidth or some level of the cache hierarchy.

~~~
jfim
Do you have examples where running n threads instead of 1 results in a speedup
greater than n?

The only thing I can think of would be that the additional threads would kick
the CPU into using a higher frequency, but a single thread using 100% of the
CPU should already do that.

~~~
CUViper
There are some broad examples of that effect here:
[https://en.wikipedia.org/wiki/Speedup#Super-
linear_speedup](https://en.wikipedia.org/wiki/Speedup#Super-linear_speedup)

~~~
saltcured
The caching effects mentioned there have to do with adding additional hardware
resources that change the effective cache sizes and shift the working set into
higher performing storage or memory.

You can also encounter a funny software engineering effect. Refactoring an
array-based algorithm to work in parallel often involves the introduction of
block-decomposition. This change allows sub-problems to be spread to different
workers. Sometimes, this change will also accelerate the sequential code, as
the new block-oriented access patterns improve cache efficiency!

------
_bxg1
None of this is surprising, right? Unless your system has fewer threads than
cores (which it probably doesn't even without your program) there will always
be some context-switching overhead. It's worth keeping in mind I guess -
especially the fact that numpy parallelizes transparently - but generally
these results are to be expected.

The title is also misleading; it suggests that the _wall clock_ time might be
longer for parallel code in certain cases. While not impossible, that isn't
what the article covers.

~~~
winkeltripel
> the wall clock time might be longer for parallel code

That is exactly the case, if CPU is the bottleneck in your already-parallel
application. It's a case where we really shouldn't be layering different
parallel bits together in one codebase, but might be doing it naively.

------
ncmncm
The article uses the term "parallelism" when it is talking, instead, about
concurrency.

Parallelism is specifically the stuff that actually does happen completely
independently on all processing units, that actually goes Nx as fast on N
units (clock depression aside). Concurrency refers to the overhead of
coordinating activity of those units, that keeps you from getting your Nx. It
is overhead on top of any actually serial parts of the computation, which
Amdahl's law addresses.

In other words: Parallelism giveth, and concurrency taketh away.

The distinction gets more useful the more you think about the subject.

------
yaroslavvb
On 64 core Xeon E5 this example gives 1.8x increase in user time, but 8x
decrease in wall-clock time

------
wtracy
This is perfectly normal behavior when Intel Hyperthreading is involved.

I'm on my phone, so rather than trying to type out an explanation, I'm going
to link to Wikipedia: [https://en.wikipedia.org/wiki/Hyper-
threading](https://en.wikipedia.org/wiki/Hyper-threading)

~~~
itamarst
The benchmarks were run on a computer with hyperthreading disabled.

