

Make your program slower with threads - bbgm
http://brooker.co.za/blog/2014/12/06/random.html

======
ScottBurson
Uh, you can't just drop in 'random_r' in place of 'random'. If you do that,
what will happen is that you'll get _the same sequence of random numbers in
every thread_ , making the whole multithreading exercise useless!

You have to be sure to seed the generator differently in each thread, before
calling 'random_r'.

~~~
frozenport
Moreover, he probably improved the single threaded version, and should re-
check the scaling curve. There should be a `1 with random_r`, `2 with
random_r`, and `3 with random_r`.

~~~
onestone
It's unlikely the single-threaded version was improved noticeably. Futexes are
nearly zero-cost when not contended.

~~~
mjb
I left the random_r numbers with single threads out of the post for exactly
this reason. Performance with random_r on a single thread was less than 3%
faster, with a slightly higher variance (which I can't explain, to be honest).

------
userbinator
It should be kept in mind that standard memory (e.g. DDR3) doesn't actually
support concurrent accesses (and at least on x86, one or two cores at most can
easily saturate the full bandwidth), so anytime you have shared data accessed
by multiple threads it's actually serialised and that part does not perform
any better than the single-threaded case; the biggest speedup occurs when the
majority of the time the multiple threads are running in their own caches and
not sharing any data.

~~~
zurn
While you are kind of correct in that a memory channel doesn't do multiple
concurrent transfers, single-threaded memory-bound workloads definitely leave
significant main memory performance on the table in most cases. (And if not
memory-bound, then of course a lot more still)

Main memory is high-latency, and single threads usually don't saturate the
bandwidth because they are not issuing back to back pipelined requests to
memory. The CPU has a single shared memory controller and it sees requests
from all cores/SMTs. A desktop x86 might support 16 outstanding cache misses
that might come from different cores or SMT "threads", and is designed to keep
the memory channels busy and request pipelines full by scheduling in work from
all the cores/SMTs, doing prefetching according to the big picture etc.

Even multiple threads accessing the same data doesn't actually look different
from the DIMM point of view, all the x86 synchronization takes place inside
the CPU caches and is invisible in the main memory traffic/requests. (This is
true even on multi-socket systems, the caches just talk to each other using an
inter-CPU bus).

Also, desktop/server systems typically have more than one independent memory
channel, in beefy servers a whole lot of them

~~~
adwn
> _While you are kind of correct in that a memory channel doesn 't do multiple
> concurrent transfers, single-threaded memory-bound workloads definitely
> leave significant main memory performance on the table in most cases._

You have to distinguish between memory- _latency_ -bound and memory-
_throughput_ -bound. If your program has a simple, regular access pattern
(which the CPU's prefetcher recognizes), then it's not that hard to saturate
the memory bandwidth of a typical two-channel, DDR3-1600 system (25.6 GB/s
peak throughput) with a single thread.

~~~
vardump
In order to get maximum throughput, you also might need to do SIMD loads (and
stores).

On Haswell this means two AVX loads, 64 bytes per clock cycle (two 32 byte
loads). I think with with scalar code, you get only 16 bytes per cycle (two 8
byte loads), and 32 bytes per cycle with SSE.

So a single Haswell core at 3 GHz can load about 178 GB/s!

On Ivy Bridge, SSE is enough to saturate memory bus.

Regardless, that's quite a bit more than what lower cache levels and DRAM can
supply. It's not very hard to saturate whole CPU-local memory bus with just
one core.

~~~
nkurz
_So a single Haswell core at 3 GHz can load about 178 GB /s! On Ivy Bridge,
SSE is enough to saturate memory bus._

I'd claim instead that it's surprisingly hard to saturate the memory bus with
a single core[1]. Presuming your math is right, you've given the rate at which
Haswell can load from L1 to registers if we can ignore latency and
concurrency. That is, the CPU can sustain this number only if the latency is
low enough relative to the number of "requests in flight". In practice, these
prevent a single core from saturating the memory bus except for the case of
dual-channel memory, perfect prediction and prefetching, and no TLB misses.

More commonly, the latency of the read from memory combined with the limited
number (10) of "line fill buffers" reduces the throughput from RAM. The
formula for actual bandwidth is known as "Little's Law": Concurrency = Latency
* Bandwidth. Considering it's fundamental importance in gauging actual
performance, it's strikingly rare to see it discussed in CS theory. John
McCalpin, the author of the Stream benchmark has an excellent explanation of
how it applies to memory bandwidth:
[http://sites.utexas.edu/jdm4372/2010/11/03/optimizing-amd-
op...](http://sites.utexas.edu/jdm4372/2010/11/03/optimizing-amd-opteron-
memory-bandwidth-part-1-single-thread-read-only/)

[1] Here's my write up of trying to do so on Sandy Bridge:
[https://software.intel.com/en-
us/forums/topic/480004](https://software.intel.com/en-us/forums/topic/480004)

~~~
vardump
Your write up was pretty interesting. Especially the instruction mixes,
LFENCEs of all things helped?! Thanks.

I haven't tried optimize fill / copy rate since late nineties. Getting the
last 10-20% appears to be rather tricky. Luckily maximum bandwidth per core is
seldom an issue.

Sometimes I've thought about designs of having one core do memory prefetching
ahead of loosely synchronized other core doing actual computation, to get
maximal serial computational throughput. The idea is, that the computational
core would fetch data from shared L2/L3 at all times. For those times, when
the access patterns are not predictable for hardware prefetcher etc.

~~~
nkurz
_LFENCEs of all things helped_

I don't really understand what was happening here, but it was a definite
positive effect. I think the issue is that the early preloads filled the
queue, preventing the stores from happening. The bright side is that getting
good performance on Haswell is easier, as it supports 2 loads and 1 store per
cycle (although note that address generation is still a bottleneck unless you
make sure that the store is using a "simple" address).

 _I 've thought about designs of having one core do memory prefetching ahead
of loosely synchronized other core doing actual computation_

I'm interested in this approach too, but haven't really tried it out. "srean"
suggested this also:
[https://news.ycombinator.com/item?id=8711626](https://news.ycombinator.com/item?id=8711626)

I think it could would work well if you know your access patterns in advance,
the hard part would be keeping the cores in sync. In the absence of real-time
synchronization, you'd need a really high throughput queue to make this work.
Interesting write up and links here: [http://www.infoq.com/articles/High-
Performance-Java-Inter-Th...](http://www.infoq.com/articles/High-Performance-
Java-Inter-Thread-Communications) It's a much easier problem if you can batch
up the preloads rather than doing them one at a time.

I have had success with a somewhat similar technique within a single core by
issuing a batch of prefetches, then do processing on the previous data, then a
batch of loads corresponding to the prefetches, then repeat. You can get 2-3x
improvement of throughput on random access if you are able to tolerate the
slightly increased latency. The key is figuring out how to keep the maximum
number of requests in flight at all times.

------
ggreer
I have some experience in this domain and I agree with the author: threads are
not a performance cure-all. Many naïve implementations will actually hurt
performance[1]. It takes careful thought, experimentation, measurement, and
quite a bit of elbow grease to get the expected speedup.

The more general lesson is this: Just because you have very good reasons to
believe a change will improve performance, that doesn't mean it will actually
improve performance! Benchmark. Profile. Fix the bottleneck. Repeat. That's
the only reliable way to make something faster.

1\. [http://geoff.greer.fm/2012/09/07/the-silver-searcher-
adding-...](http://geoff.greer.fm/2012/09/07/the-silver-searcher-adding-
pthreads/)

~~~
JoeAltmaier
We're in a strange place with optimization. Hardware folks build very complex
memory access paths (multiple caches/multiple accessors) to try and speed up
the average case of unoptimized code. Then we try to trick that mechanism into
performing for our particular computations. Maybe some programmable
architecture would help here? Maybe not.

~~~
hollerith
IOW, programmers assume hardware is simpler than it really is whereas hardware
designers assume software is simpler than it really is.

------
Animats
_" It's little more than a tight loop around a pseudo-random number
generator."_ One locked against concurrent access.

Yes, put a lock and context switch in a tight inner loop, and your performance
will suck.

~~~
ProCynic
I think the point is that he didn't know there was a lock. His closing
argument seems to be that when you're writing muli-threaded code, you have to
not only think about what your own code is doing, but also about what all the
library and system calls you use are doing. Which is one of the many reasons
multi-threaded code is harder to write than single threaded.

~~~
blt
Yes, I thought rand()'s self-locking was pretty common knowledge, but it is a
good example to teach the perils of multithreading because its signature gives
no indication that anything bad will happen. strtok() is even worse.

C++11's excellent <random> library represents the PRNG as an object like it
should. Besides making the statefulness of PRNGs explicit, its API encourages
correct usage - it would be _more_ work to share a PRNG between threads.

~~~
Animats
_strtok() is even worse._ (Because it has global state)

There are a lot of C standard library functions from the 1970s that should
have been moved to a "deprecated" library around 1990 or so and phased out by
now.

~~~
vardump
Throw zero-terminated strings away too, please. Scanning sequential bytes is
very slow and has caused so many bugs.

------
learnstats2
The author writes that this is a contrived case but I was doing Monte Carlo
simulations 10 years ago, i.e. I wanted _exactly_ this case.

In the end, I just let them run on one thread and could never work out how
they could be 10x slower with threading - until today. Thanks.

------
pjbringer
This exact problem had me worried when I heard about OpenBSD's arc4random, and
how they've been promoting a pervasive use of it. I haven't taken the time to
look at how they solve the problem of contention, while still maximizing the
unpredictability of the random number generator's state by using it.

~~~
anon4
Aren't you supposed to use arc4random once to seed a high-quality pseudorandom
generator? From what I understand, seeding a high-quality PRNG with good
source of randomness is really all you need.

~~~
makomk
arc4random _is_ their high-quality pseudorandom generator that's seeded from a
good source of randomness (namely, the getentropy system call).

~~~
thirsteh
Yes, but like urandom the general intent (at least when performance is
desired) is you use it to seed your own PRNG and avoid making system calls
every time you need random data.

------
fche
The author's multithreaded random_r version has the benefit of performance,
but has the problem of, well, brokenness. Without locking the internal state
of random_r(), it will be corrupted, and the results won't be meaningful.

~~~
mjb
Can you explain more? My understanding (which seems to be backed up by the
implementation) is that I can call initstate_r in each thread to initialize
thread-local state, then call random_r without synchronization from each
thread as long as I used that thread's thread-local state.

random_r's internal state is all, from what I can see of the implementation,
entirely passed in by the caller. Implementation here:
[https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/rand...](https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/random_r.c;h=87cfdc285c0f6e4a4911eac0f6dd5efa2c3fb8b6;hb=9752c3cdbce2b3b8338abf09c8b9dd9e78908b8a)

~~~
ScottBurson
You are correct, and 'fche is mistaken. Each 'random_data' object is accessed
only from a single thread.

------
plg
what about using OpenMP?

My recollection is that I wrote a similar program (statistical bootstrapping,
essentially a loop around random() ) and using OpenMP on a 4-core machine
definitely produced a speedup (not x4 but close)

Does OpenMP somehow sidestep this issue of shared access? (or is my memory
wrong)

~~~
nkurz
My first thought was that this was a silly question, that there is nothing
magic about OpenMP, and that the throughput would be the same. But thinking
more, this would be a good thing to benchmark.

The difference is that OpenMP will be using multiple processes instead of
multiple threads. Since the bottleneck in this case is access to memory in
common between the threads, and since different processes won't be sharing
this memory, I think OpenMP would indeed have a 4x speedup on this problem on
a 4 (physical) core machine if the runtime was long enough to offset the cost
of launching the processes with OpenMP.

More generally, it would be worthwhile to benchmark the performance difference
between multiple processes with created with fork() versus multiple threads
pthread_create(). If there is no need to write to the same memory (as with a
bootstrap) the shared-nothing process based approach is going to more likely
to get a linear speedup and is usually easier to reason about.

------
jokoon
I don't really see the point of using multithreading for speed up if you have
less than 10 cores. that program example might run faster with opencl.

