
Simplifying Parallel Applications for C++: An Example using RaftLib (2016) - bob_rad
https://medium.com/cat-dev-urandom/simplifying-parallel-applications-for-c-an-example-parallel-bzip2-using-raftlib-with-performance-f69cc8f7f962#.ax1lzrq1n
======
malcolmgreaves
Great presentation of the problem, code to solve it, and performance analysis.
A super enjoyable, not-lengthy-at-all read! Certainly makes me more interested
in RaftLib (even though I don't do a lot of C++ programming!).

------
zvrba
So why would I want to use a relatively unknown library still in alpha instead
of Intel TBB, which also supports dataflow graphs?

~~~
vmarsy
> So why would I want to use a relatively unknown library still in alpha
> instead of Intel TBB, which also supports dataflow graphs?

Not you specifically because you probably don't run applications on super
computers with millions or even just hundreds of processors, but for those
interested in scaling above multi threaded programs with Intel _Thread_ ing
Building Blocks, it could be an interesting library.

~~~
jcbeard
Hi! I'm the primary author and maintainer of the library. Thanks for the
interjection. The intended application for RaftLib is to make something that
will scale, just as you mention. I wrote this post a long time ago when I was
just trying to get people interested in using it. It's a simple example that
shows you can take many lines of standard parallel code, and write a much
easier to read (smaller) version in a very short time that performs just as
well or better than the manually managed parallel code.

A long time ago I was a biologist, then bioinformaticist. I wrote some code
that would scale to a single node, and to a few dozen cores. In doing so, I
realized how much I hated writing the same boilerplate code over and over
again. TBB, c++11 threads, OpenMP, and MPI all basically have the same level
of boilerplate and gotchas. I wanted to make something that was relatively
easy to use and easily integrable with C/C++ code. Go was the only thing that
came close, but it was brand new at the time.

It occurred to me while working on the AutoPipe system as a grad student that
I could do something even better than a simple coordination language and at
the same time subsume the functionality of a lot of parallel libraries. With
stream/data-flow processing, I can do the exact same things I can do with
OpenMP and MPI, but I can do more. The state encapsulation allows a whole host
of cool optimizations, like identifying bottlenecks and duplicating actors
dynamically (there's a whole host of reasons we'd be limited in OpenMP, c++11
threads). You can also compile an encapsulated function to another hardware
platform entirely, or use high level synthesis tools to go to an FPGA (I'll be
going there again soon too with RaftLib). The only thing that has to be
constant across optimizations, is the connectivity of the DAG. By maintaining
a port interface, just like you would hardware components (see Arvind's work
from MIT...he's famous enough I just have to say Arvind :), we can compose
really complicated applications. The port interface, it turns out, is also
perfect for distributed compute.

Awhile back, I also had the realization that iostreams were perfect for this
paradigm. Once you get your head around the concept, it seems quite natural.
If it doesn't take off as a library, oh well. I enjoy working on it, and using
it so I'll likely keep developing it in my spare time.

In the interim, I'll get back to exascale hardware stuffs :).

~~~
Drdrdrq
If I read your charts right, your app's single core performance was much
better than pbzip2's, which is quite surprising. I thought these apps were
severely optimized... Any comment?

~~~
jcbeard
Yup, it was quite a bit better on the upper end especially. Looking at the
snoops on the bus using PAPI RaftLib does a better job at keeping the cache
lines from bouncing.

The benchmarked version also has a dynamically resizing FIFO which uses
utilization of the queue itself to guide the sizing. This means that the FIFO
can better adapt to dynamic behavior found in most applications run on top of
an operating system (most all these days outside of HPC). Looking at load
stalls, the RaftLib version has fewer, but not quite enough to account for the
results.

If you look at the single worker thread case, then jump to two threads..you
can see a fairly big jump. RaftLib by definition is a pipelined programming
system. The read file and compress are done perfectly in parallel. The bzip2
code doesn't quite pull it off in a perfectly pipelined fashion. It's close,
but not quite. This results in less overlap of execution and communication. If
I'd run on Linux (thread affinity on OS X is well, fun last time I checked..if
not impossible to do manually), I'd also add thread affinity to the list which
most people don't bother to optimize. Hot caches and synergistic cache
accesses are quite beneficial.

------
frostirosti
Good luck my friends. Many have fallen at this pillar of parallelism before
you.

~~~
jcbeard
thanks!, and meh.

~~~
je42
Hard problems can be very exciting to solve!

Never give up until you feel yourself that advancing is not going to work!

Looking forward to the beta and the full release of your lib! The C++
community needs this !

