
Parallelism as a First Class Citizen in C and C++ - fogus
http://software.intel.com/en-us/blogs/2011/08/09/parallelism-as-a-first-class-citizen-in-c-and-c-the-time-has-come/
======
cabacon
He's totally right, but people have wanted this, and tried to implement it,
several times. There's UPC (<http://upc.lbl.gov/>), Co-Array Fortran
(<http://www.co-array.org/>), OpenMP (<http://openmp.org/wp/>), TBB
(<http://threadingbuildingblocks.org/>), Cilk from MIT per scott_s, then there
are the CUDA/OpenCL accelerator extensions ...

We don't need a call to arms without a pretty good idea of how to do it, and
why it is different/better than the existing shots at parallelism. Jamming it
into C/C++ is one idea, and making a new language like Fortress
(<http://en.wikipedia.org/wiki/Fortress_(programming_language)> ) is another.

And those are all just the languages / language extensions. There's also the
message-passing (MPI, PVM) vs. remote put-get (ARMCI/Global Arrays/...) It's
clearly a hot topic with how multi-core chips are coming along. I didn't see
much here that adds to the existing attempts, other than perhaps bemoaning
that they are currently proprietary. That seems natural, though. With so many
competing ideas, you see which one gains traction first, then work on
incorporating it into the standards.

~~~
scott_s
_With so many competing ideas, you see which one gains traction first, then
work on incorporating it into the standards._

But that's what he's promoting. Cilk has been around for a long time, since
before the current multicore era.

~~~
cabacon
I'd argue that compared to OpenMP and CUDA, Cilk has very little traction. My
frame of reference is the current set of HPC platforms, though. We had one
customer who wanted to build Cilk, and it was really just for R&D, not
production.

~~~
scott_s
I don't consider CUDA in the mix because it's designed specifically for GPUs.
But, yes, OpenMP has much more traction in the HPC community, because it was
designed by and for them. Its task parallelism, though, is rather ugly. I
don't know what Cilk's data parallel abstractions look like, but I suspect
they're better than OpenMP's task parallel abstractions. (Just because, well,
I think OpenMP's are that bad.)

But, fundamentally, OpenMP is _not_ integrated into the language. It's tacked
onto the language through pragmas. I think that was a hack, not a long-term
solution. And I say this as someone who did the exact same hack:
[http://people.cs.vt.edu/~scschnei/papers/scott_dissertation....](http://people.cs.vt.edu/~scschnei/papers/scott_dissertation.pdf)

~~~
cabacon
Agreed re: pragmas as a hack, but that's just the kind of thing you'd expect
to get to gain traction, before folding into a language standard.

And re: data parallelism, I don't see that anyone has made a terribly popular
implementations. The niche languages like UPC, CAF, and HPF all seem to have
withered on the vine. So far, the only thing people seem to buy into is that
openmp-based task parallelism is easier than managing threads by yourself.

------
scott_s
Integrating parallelism into a language is an easy sell for me. And, I like
his points, but what what is the biggest news to me is that they are
integrating Cilk Plus into g++: <http://software.intel.com/en-
us/articles/intel-cilk-plus/> At first I thought they were open-sourcing the
current Cilk implementation that is a part of Intel's C/C++ compiler, but I
think that is still proprietary.

Intel, as company, still has a mixed message when it comes to shared-memory
parallel programming, as evidence by their Parallel Building Blocks:
[http://software.intel.com/en-us/articles/intel-parallel-
buil...](http://software.intel.com/en-us/articles/intel-parallel-building-
blocks/)

Thread Building Blocks was an internal thing - which it appears that the
author was a part of. He literally wrote a book on it:
[http://www.amazon.com/Intel-Threading-Building-Blocks-
Parall...](http://www.amazon.com/Intel-Threading-Building-Blocks-
Parallelism/dp/0596514808) The solution he's championing is from Cilk Arts,
who Intel purchased back in 2008. But this article makes no mention of Array
Building Blocks, which is the rebranding of Rapid Mind, which Intel also
purchased in either 2008 or 2009.

If you want to read papers on multithreaded programming that were almost
before their time, read about the Cilk project back when it was pure research,
before it was spun off into a company which Intel bought. Google Scholar can
help: <http://scholar.google.com/scholar?q=cilk> "The implementation of the
Cilk-5 multithreaded language" is a particularly good paper.

~~~
jamesreinders
Good questions... We _are_ open sourcing the entire runtime we use, and
contributing to open source. We have created an open source runtime project
for use by any compiler, including gcc, and we will be using it ourselves (or
we already do).

TBB is the most widely used abstraction (not OS threads like pthreads) for C++
parallelism (several developer surveys confirm this). OpenMP is used by less
developers in the U.S. surveys, but OpenMP and TBB don't really compete for
developers because TBB is _very_ C++, and OpenMP is not. TBB was contributed
to open source in 2007 by Intel, and is a very active project - with lots of
users and ports virtually everywhere. Users include well known names like
Dreamworks, Adobe, Autodesk, and EA in their key applications. It's new enough
(5 years) that many people have not yet heard of it... but it has a very
substantial user base.

I'm biased a bit - but I wrote the book about TBB after learning of the
project and loving it, not because I was on the project.

We believe in "library first" then (when you know what you are doing) put it
in the language.

TBB was the library. Cilk Plus is the language equivalent.

ArBB (Array Building Block) is the library. It needs to bake still from what I
see. It's worth a look - but it is new and rougher... despite the RapidMind
experience with product. These things take time. I'd suggest it lives as a
library for at least 5 years before we get too excited about trying to learn
from it and change a language.

Cilk's been at this since the mid-90s. TBB was our strongest way to say "we
believe in Cilk and we want the experience in product"... so we added 5 years
of that to Cilk's experience, and I think we are very ready to show the
results and argue we can be ready for standarizing.

~~~
scott_s
_TBB is the most widely used abstraction (not OS threads like pthreads) for
C++ parallelism (several developer surveys confirm this)._

I come from the HPC community, where this is not the case. OpenMP or just
plain ol' pthreads are dominant for shared-memory parallelism. That's partly
because in the HPC community, data parallelism dominates, not task
parallelism. Can you point to the study? I'd like to see who they surveyed,
what they asked and just what their general methodology was.

Also, from what I was able to find out about RapidMind, their product was a
combination of library and compiler support - but I was never able to get a
good grasp of what exactly they did. They were rather secretive. But my
understanding is that it's not accurate to describe ArBB as a "library." That
is what I've always found curious about making Cilk and RapidMind live
together in one product; as I understand it, they both require compiler
changes, and they do have some overlap in functionality.

Also also, welcome to Hacker News!

------
Maro
The OP mentions spawn/sync, so I looked it up, here's a snippet from Cilk's
wikipedia page [1]:

    
    
      01 cilk int fib (int n)
      02 {
      03     if (n < 2) return n;
      04     else
      05     {
      06        int x, y;
      07  
      08        x = spawn fib (n-1);
      09        y = spawn fib (n-2);
      10  
      11        sync;
      12  
      13        return (x+y);
      14     }
      15 }
    

[1] <http://en.wikipedia.org/wiki/Cilk>

------
16s
The new C++ standard has std::thread (came from boost::thread) and most
compilers already support it. So threads in C++ are no longer provided by a
library, right? Am I missing something, is std::thread not good enough?

Edit: I meant to write, "provided by an external library".

~~~
scott_s
std::thread is a library - in fact, all of std:: are libraries. The new C++
standard has an actual memory model and a thread interface but that's not the
same as parallelism being integrated into the language.

Hans Boehm's paper, Threads Should Cannot be Implemented as a Library:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.2412&rep=rep1&type=pdf),
brought this up a while back. The new memory model in C++
(<http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/>) helps solve the problems
he brought up in that paper. The new C++ standard now has a memory model, and
(if I understand the implications correctly), there is well-defined behavior
when using threads in C++.

But the author's point is deeper. Providing an interface to threads, and a
well-defined memory model, is a good step. But threads are a primitive for
implementing parallelism. They are a very thin abstraction. The author of this
article is arguing for richer abstractions that are integrated into the
language. If you want to write parallel programs, and you have to use threads,
mutexes and condition variables to do it, your code is going to be harder to
write, debug and reason about than if you had used higher level abstractions.
(And you will probably end up reimplementing some parts of those higher level
abstractions.) In this regard, C++ is behind. So, kudos to the Cilk team for
integrating their work into g++.

~~~
16s
Yes, you are right std::thread is a library. I meant to write "provided by an
external library". But I won't edit my original comment as doing so would
take-away from your response.

I see std::thread as a great first step. The richer abstractions will come
later, but I think discounting std::thread is unwise. It's a great advance for
C++.

~~~
scott_s
Libraries are the same no matter what namespace they live in [1]. The C++
compiler does not give special treatment to the libraries in std::. So there's
no different between "external" libraries and libraries specified by the
standard.

[1] I suspect that most of the compilers will make the std::atomic_* interface
([http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2007/n242...](http://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html)) compile down to compiler
intrinsics, but the interface will still be just a template library.

------
snorkel
For C I'd rather see parallel capabilities added to the stdlib rather than
messing with C's syntax. Syntax creep has rendered C++ practically
incomprehensible for mere mortals and it'd be a shame to see C get pushed down
the same path to madness.

~~~
scott_s
I don't think C has a well-defined memory model. So it still has the problems
Hans Boehm talked about in his Threads Cannot be Implemented as a Library
paper:
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.2412&rep=rep1&type=pdf)

~~~
snorkel
Yes, C's memory model is fraught with concurrency race conditions, but it's
fast, direct, and dangerous and that's part of it's appeal. It'd be a shame to
see C be tamed and muddled to look more like its caged derivatives.

~~~
scott_s
C does not have a memory model. Its "memory model" is whatever is provided by
the underlying hardware.

------
numeromancer
How does one keep serial semantics while making parallelism explicit? Those
two goals seem to be in conflict, one with the other.

~~~
scott_s
Out of order processors do it all the time: <http://en.wikipedia.org/wiki/Out-
of-order_execution> Most modern processors execute instructions out of order.
They have hardware logic specifically designed to keep track of dependencies
among the instructions, ensuring that even though instructions are executed
out of order, instructions are _committed_ in order, and if instruction A
depends on the result of B, A does not execute until B finishes.

And therein lies your answer: many operations "don't care" what order they are
executed in. Consider the algebraic expression A + B + C + D + E. You can
"execute" that expression in any way, even doing some in "parallel" and still
end up with the same answer that you would if you did it the intuitive way,
from left to right. That's trivial. If you have A + B * C + D * E, then it's a
little more complicated because you have to ensure that B * C happens before
the additions that involve B and C, and same with D * E. But surely it's not
hard to imagine a static analysis that recognizes such dependencies, and a
runtime that enforces their order. So no matter what order the expressions
were actually executed in, you can pretend they happend in the serial one you
expected.

~~~
numeromancer
Thank you. What you say is correct, but not to the point. The expression you
gave is not _explicitly_ parallel. And more to the point, if I can expect
expressions to be executed in a serial order, than I don't see how I can make
them explicitly parallel.

~~~
scott_s
Think of it as an explicit _request_ for parallelism. So I may say something
like:

    
    
      parallel_sum(array);
    

And it would do a data-parallel summation of my array. In fact, OpenMP
(<http://openmp.org/wp/>) is a great example of what you're asking about.

    
    
      #pragma omp for
      for (int i = 0; i < N; ++i) {
        dest[i] = a[i] + b[i] * c[i];
      }
    

That is explicitly requested parallelism with serial semantics. The serial
semantics are what you would expect without the omp pragma. But at runtime, it
will execute in parallel. The compiler ensures that I did not do anything
which will violate serial semantics, and the runtime system does the work of
farming the work off to threads, and then synchronizing them.

~~~
numeromancer
This may be pedantic, but the #pragma seems more to be _overriding_ the serial
semantics, rather than complementing it.

BTW: what do you think of OpenMP? What are its conveniences and frustrations?
I may be looking into ways of doing embedded parallel development soon, and
OpenMP looks convenient, since it's implemented in gcc.

~~~
scott_s
No, _semantics_ are what something _means_. The results of executing the code
- it's _semantics_ \- are the same with and without the pragma. If that
distinction bothers you, consider that compilers perform all sort of dirty
tricks when optimizations are turned on, sorts of things that you didn't ask
it to do, but it is allowed to do because it preserves semantics.

If you have data parallel code and you're working in C, C++ or Fortran, OpenMP
is an excellent solution to gain performance from shared memory parallel
machines. While I say above that it's a hack, it's a hack in terms of language
design, it works. It's a very well defined standard, and any problem that you
have will have been encountered by thousands of people before you, so you
should be able to find solutions online easily. It's super convenient for data
parallel code like the above. It's frustrating when you want to do task
parallelism.

I'm unfamiliar with OpenMP in the embedded world, though. OpenMP in gcc relies
on Pthreads, and I don't know if that will be supported on your platform.

------
Maro
The existing C++ standard, ISO/IEC 14882, was published in 1998 and updated in
2003. The new standard has just been approved, and will hopefully come out in
2011 and will be called C++11. So, a new C++ standard takes about 10 years.

For the foreseeable feature, we'll be stuck with libraries and compiler
specific features for parallelism.

------
zeratul
Totally agree with: "Maintaining serial semantics is important. A program
should be able to be understood as a serial program."

So far I used "fork", "MPI" and "CUDA" but none of them make code easy to
read.

------
Meai
Is there any reason not to use it? Basically every computer nowadays has
multiple cores.

------
baltcode
Does Cilk work for GPUs?

~~~
wmf
Not really; IIRC it assumes MIMD hardware. I think of Cilk as useful for
programs that can't be written in OpenCL.

~~~
scott_s
GPUs are data parallel machines. Cilk has support for both task and data
parallelism. The task parallel constructs in Cilk wouldn't make much sense on
GPUs.

~~~
baltcode
So data parallel constructs like parallel for loops can work with GPUs? I mean
not just in theory but can the current implementations program GPUs?

~~~
scott_s
Current implementations of Cilk, no, not as far as I know. There's also the
fact that you need to transfer the data to the GPU. The transfer costs may
kill any benefits from parallelism.

------
kennystone
ZeroMQ is pretty good at making them parallel. Erlang nifs can get the job
done too if you want a more exotic solution. Ship has sailed for putting
something directly in the language.

~~~
leif
This is a discussion about much lower-level parallelism.

