

C++11 async tutorial - octopus
http://solarianprogrammer.com/2012/10/17/cpp-11-async-tutorial/

======
malkia
The "fade", "grad" and the "lerp" are written as member functions (so there is
some "this" pushing), yet there is no need for so. There might be some speedup
(overall) inlining them.

The author claimed it's straight JAVA translation from Ken Perlin's code, yet
he managed to drop the "static" for the above three functions.

EDIT: Some more problems.

Since perlin noise is perfect candidate for data-parallel split, it would much
more efficient to split the job where instead of creating one image / thread,
you split one image among all threads and process them individually.

This way memory usage is reduced - e.g. you are processing one image, not N.
For very big images it might be even more efficient to limit processing of one
thread to a 16x16 or 256x1 image blocks.

There is another Java->C++ problem with the code in this case - since you
instantiate the p[] array (not static again), this means that each worker
would have it's own array, hence wasting L1 cache (the p[] data is all the
same).

Overall this task is more suited for OpenMP (data-parallel), than the async
stuff (but thanks for demonstrating it, I was not familiar)

~~~
minimax
It looks to me like the lowest hanging fruit in terms of reducing the
program's run time is in the PPM output code.

[https://github.com/sol-
prog/async_tutorial/blob/master/ppm.c...](https://github.com/sol-
prog/async_tutorial/blob/master/ppm.cpp#L104)

Three calls to ostream::write() for each and every pixel. oof.

------
SoapSeller
"As a side note, the parallel version uses about 280 threads on my machine vs
a single thread for the serial version."

This is a great example of how you shouldn't run things in parallel. Running
280 Threads on 2-4 cores machine makes no sense at all.

~~~
octopus
Actually it makes sense, if you manually force the code to use only 4 threads
it will take more time to finish that in the case when you let the compiler
split the work for you.

On a given machine there are hundreds of threads running concurrently, the OS
will let them run in time slices on the available hardware. Using the same
number of threads as the number of available processors does not mean your
code will be faster than when you use a larger number of threads. Obviously
this is OS and compiler dependent.

~~~
exDM69
> Actually it makes sense, if you manually force the code to use only 4
> threads it will take more time to finish that in the case when you let the
> compiler split the work for you.

While I think this is correct, I think your reasoning is off. And as far as I
know, the compiler does not do anything clever here (unlike in some smarter,
less mainstream languages) and you're just launching a huge number of threads.

If you use only 4 threads in a similar manner in a loop, the problem is that
the background threads finish and the CPU is idle while the serial loop that
creates the threads is occupied by creating and launching the threads and
there are not enough threads to keep the cpu busy. (edit: this is more likely
about waiting for I/O, see below).

If you're truly CPU bound, you will not gain from having a lot more threads
than CPU cores (* ~1.5 for hyperthreads). If your threads are waiting for I/O,
adding more will cause the OS to switch to an available thread but you gain
more by using asynchronous I/O functions (like aio and epoll or kqueue, which
you won't find an abstraction for in C++ stdlib). The way the code in the blog
works is actually spending a whole lot of good cpu cycles for doing context
switching from one thread to another and wasting a lot of memory for
maintaining threads that wait for disk I/O.

What you can't see from the tutorial source is what the worker actually does
(make_perlin_noise), I assume that function also does the writing to disk
which would explain why you benefit from launching a huge number of threads.
On the other hand, your memory use is completely unacceptable with this
solution.

You should try this: split the number of images evenly and make each thread
compute a number of images. Launch roughly as many threads as there are cpu
cores and do not make the threads wait for disk I/O. You should see
performance go up and memory use go down.

~~~
Nursie
One of the patterns I've used before is to use thread pools rather than
starting and stopping threads all the time. Each thread in the pool can be
passed a list of the tasks it needs to do, or grab tasks from a central list
as it finishes a previous one.

Then after a few runs you can tune the pool size so that you get the most out
of the machine. Obviously there are still gains to be made by switching to
async io, avoiding any serial choke points etc, and keeping the pool size
small, but it seems to be a decent approach.

