Examples of Parallel Algorithms from C++17

glangdale · on June 27, 2018

I like the approach for generality, but for something like the word count example, there's absolutely no way that parallelizing at the multi-core level would beat doing this in SIMD. None of those examples are topping at at more than a gigabyte per second, which is walking pace for this sort of problem.

Further, I suspect that the extremely regular pattern of the space boundaries is probably giving the branch predictor a leg up. It almost certainly would pick up the 'every 5' pattern if not the 'every 17' pattern. So a realistic example would be far more branch miss heavy.

This is still fine work, as the effort of writing a nice C++ function and getting some parallel help on it is tiny compared to writing SIMD code, and some problems don't play nice with SIMD at all.

pjmlp · on June 27, 2018

Thing is, not very many developers actually understand how to break an algorithm into SIMD, whereas making use of parallel algorithms is a much more accessible.

_wldu · on June 26, 2018

Looking forward to coroutines. I really want to see how they compare to Go's implementation (which is great).

jstimpfle · on June 26, 2018

FWIW, you can implement cooperative multithreading yourself, by using the Fiber API on Windows, or with a few unstandardized function calls on Linux/probably other unixes.

It works beautifully (of course, putting the fact that it's slightly "magical" aside). It's much more comfortable to write Javascript-style programs in C than it is in Javascript (where you need to chain callbacks) this way :-)

infinite8s · on June 26, 2018

You don't need to chain callbacks with await/async anymore.

joebaf · on June 26, 2018

from the recent ISO meeting, it looks like the TS merge is uncertain for C++20.

markdoubleyou · on June 27, 2018

Wow, I thought coroutines were pretty much a done deal... I guess all the big compilers implementing the TS wasn't a good indicator of progress.

Skimming the trip reports just now, it sounds like some Google engineers tried to use them, didn't like them, wrote a whole new proposal, and threw the schedule into disarray. Sigh. I'm sure they had good reasons to do this, but c'mon guys... at some point you have to decide that done is better than perfect.

Rusky · on June 27, 2018

The Google proposal does have quite a few things going for it, IMO.

The TS heap-allocates all coroutine frames by default and hopes that can be optimized out; the Google proposal treats the coroutine frame as an anonymously-typed object much like a lambda and leaves allocation (or not) up to the surrounding library.

The TS has a huge pile of extension points to control behavior; the Google proposal provides the same functionality via a single operator overload (which plays a role similar to the argument passed to call-with-current-continuation).

The TS has co_yield, co_return, and co_await; the Google proposal has a single operator that takes the place of co_await, while co_yield and co_await are made unnecessary by its interface.

The Google proposal does have some things that need to be solved, but (again IMO) it's a much more promising direction than the TS.

markdoubleyou · on June 27, 2018

Yes, these sound like good ideas... I'm not a big C++ language nerd and not in a good position to judge whether this is valuable or navel gazing. But, as bystander who only dips into C++ a few times a year, I see these antics and wonder if the language would be better off with a benevolent dictator who can actually get a feature out the door, even if it's less than perfect.

I remember watching Gor Nishanov's CppCon talk from 2015(!), learning a lot, seeing a working implementation, and getting really fired up about this feature. Now it's 2018 and it's back to the drawing board? What happened? Was Google the smart kid in high school who skips class and waits until the last second to do their homework? I know the committee has only the best intentions, but with stuff like this they're their own worst enemy.

https://www.youtube.com/watch?v=_fu0gx-xseY

gumby · on June 27, 2018

> wonder if the language would be better off with a benevolent dictator who can actually get a feature out the door, even if it's less than perfect.

The history of standards is littered with such attempts, w/ and wo/ a dictator, and it never ends well. Reasons include that bad ideas/implementations become very hard to eradicate (they will have been used in legacy code); some coding standards mandate use of standard functionality even if a better alternative exists (for good reasons such as portability, but still) and they increase the activation energy to do something well.

Another value of a standards organization is a drive for solid (not necessarily fanatic) orthogonality and comparability.

C++ has suffered from some legacy mistakes (e.g. << io operations) and hasty mistakes (std::auto_ptr) and appears pretty reluctant not to get it wrong again.

Better to let some ideas get worked out in detail with some use experience before sticking them in, especially for features that can be implemented in a library.

tomsmeding · on June 27, 2018

I agree that it might have been possible to get this done much earlier, but I don't agree with your premise that "done is better than perfect". Lots of people complain that while C++ packs a lot of features, it's ugly and everything is mismatched, as it seems. I don't necessarily agree completely, but a number of things are just annoying (e.g. how do you overload pre- and postfix ++? operator++() and operator++(int), where the dummy int doesn't do anything. Beautiful.)

I believe that even at the cost of speed of progress, getting this kind of impactful feature right is worth the trouble, especially since therr are already other possibilities for using coroutines in C right now, although they're not nicely fitted into the language.

markdoubleyou · on June 27, 2018

That's fair. We're just speaking from two different perspectives: You (and the committee) want the best possible language, and you're willing to wait 5 years to get it. I want to see C++ evolve quickly so I can get some productivity and performance wins, in which case I'd probably start using the language for more projects. I gather there was enough mileage on Gor's proposal that it wasn't some vector<bool> kind of screwup, so I would have liked to have seen them call it good and ship it.

So average Joe Users like me have to wait and wait while it's made better and better. If you'd put this decision to a vote of 10,000 MSVC users instead of a bunch of compiler vendors and Google engineers, I bet coroutines would've been merged, warts and all. Instead, the years pass, attention drifts, modern languages beckon, and C++ becomes even more niche.

pjmlp · on June 27, 2018

On the other hand there are a couple of domains where not having an ISO certification hinders adoption, so by having a benevolent dictator C++ wouldn't be used in such domains.

snowAbstraction · on June 26, 2018

I really wonder what the C++ community will think of execution policies in 5 years. Using them seems to be interesting step up from coding your low level parallel algorithms and instead using standard C++ algoritms with an execution policy to build your algorithm.

But are they an abstraction that will be useful in the future? Worth it for the language?

ridiculous_fish · on June 26, 2018

I think execution policies will prove too abstract and coarse in their current state, so won't be very useful.

Normally it's good to have pleasant abstractions and generic functions. But execution policies are there for speed and speed alone: if they don't make things faster, they've failed. So in this case it's worth tolerating a more awkward API to achieve better performance.

The C++17 parallelism interface is more or less to declare "this loop can be parallelized," which does not provide enough information to the scheduler. You need to express properties of the algorithm, such as its optimal chunk size. (In the future perhaps those properties will enter into the execution policy as e.g. constructor parameters.)

Second, passing in callbacks which receive one or two elements at a time is too granular. You are at the mercy of the compiler to optimize these. A better approach would be for the callbacks to receive a pair of iterators and allow the user to drive the loop. This makes it easier to avoid accidental indirections, failed alias analysis or inlining, etc.

Intel's TBB will continue to be more useful here.

AtlasBarfed · on June 26, 2018

It does seem that custom policies are needed when deadlockable resource pools, throttling, or per-processor type of things need to be handled.

geezerjay · on June 27, 2018

I don't believe these functions are supposed to be one-size-fits-all solutions. I assume they are intended as the answer to "how come std::transform() doesn't run this embarassingly parallel problem in parallel?"

Sharlin · on June 26, 2018

It's what others have been doing as well. For instance, Java Streams were written from the start with a fork/join execution model in mind, and in Rust rayon::par_iter() is often a drop-in replacement of the regular sequential iter().

waynecochran · on June 26, 2018

I'd like to compare parallel CPU and GPU versions with Thrust: http://thrust.github.io When is a vector big enough that is worth processing on the GPU?

pavanky · on June 26, 2018

It depends on the type of vector operation(s) you are doing and the machine you are on. For one off vector operations it is never worth it to make a transfer to the gpu.

If you are going to have a lot of temporary vectors as part of a larger algorithm, it is usually beneficial to copy the inputs once, do all the computations on the gpu, and copy them back.

cma · on June 27, 2018

With the caveat that integrated GPUs with unified memory can skip the copy.

gpderetta · on June 27, 2018

You would still need to transfer the data from the core L1/L2 to the GPU (same as for inter core communication). While cheaper than a copy though the pci bus it is not free.

makapuf · on June 27, 2018

I'm really curious how you would do that ? What kind of APIs can express this ? Can they detect unified integrated gpus?

jjnoakes · on June 26, 2018

It's a shame the parallel recursive file size sum example reads all of the file and directory names into memory first, and then iterates over them in parallel after. A clean concise version of the same example, in a stream-like but still parallelized way, is not obvious to me at first glance (although I don't touch C++ often, perhaps someone here knows of a good way?)

burntsushi · on June 26, 2018

Indeed, I've actually written such a thing (although in Rust, not C++), I'm still looking for a clean concise version. :-)

Here's the public API of mine[1], which is pretty blah, because it takes a closure (per-thread initialization) that must return a closure (executed on each entry in the tree). The most interesting part of the implementation is here[2], along with a suspicious termination argument.

Outside of that, ucg (written in C++) also has a recursive directory iterator.[3]

I also know of two written in Go, one in sift[4] and another in the platinum searcher[5]. The ones in Go seem considerably simpler as they fall back to goroutines.

I haven't carefully benchmarked all of these, but if memory serves, they're all faster than the single threaded variants. In my experience, the hardest part about implementing this is coming up with a solid termination argument because your consumers are also your producers. (I documented this a bit in `get_work` in my Rust code.)

[1] - https://docs.rs/ignore/0.4.2/ignore/struct.WalkParallel.html

[2] - https://github.com/BurntSushi/ripgrep/blob/b38b101c77003fb94...

[3] - https://github.com/gvansickle/ucg/blob/master/src/libext/Dir...

[4] - https://github.com/svent/sift/blob/2ca94717ef0bbf43068f217c8...

[5] - https://github.com/monochromegane/the_platinum_searcher/blob...

jjnoakes · on June 26, 2018

Yeah, I've written such a thing many times as well, in languages like Go which make it easy and languages like C which make it not-as-easy and a few in between.

I was more curious if there was a cleaner C++ way of doing it with all of the recent or pending languages changes.

jstimpfle · on June 26, 2018

Anyway, why would anybody standardize this at all? Too many ways to do it and each one sucks because of race conditions. Any program that needs such a feature should just write the 10-20 lines version that it deems best.

There will hardly be any project that needs to know the "recursive file size" that is also too small to be able to afford this extra work.

Ok, but maybe I'm missing something here: Why parallel? Is there any benefit?

AtlasBarfed · on June 26, 2018

Yeah, it's kind of annoying that most parallel libs only show single-depth/degree examples. I get that's good for a "five minute intro", but usually parallel processing involves dynamic task production (descending the filesystem tree nodes) in addition to the parallel processing of the tasks themselves.

... in addition to the tasks behaving like streams.

saagarjha · on June 26, 2018

That’s because the directory iterator isn’t random access, and I don’t see a way to get the functionality you want in any language, not just C++.

stochastic_monk · on June 26, 2018

To do that, you’d have to buffer records from the file and dispatch threads to perform your task. Lazily loading can be preferable for large workloads, but it requires fine-grained threading. (And probably dedicated reading/writing threads and/or locks.)

That’s a lot more work than this, though.

saywatnow · on June 27, 2018

Nice to see a fairly thorough set of examples, but my only conclusion is that C++ really has become a parody of itself.

The language describing each policy is awful, but okay.

"so concise code" for 10 lines consisting mostly of ceremony to get the results of one std function into a vector?

Does the first example with std::transform::reduce really need `std::uintmax_t{ 0 }` twice instead of `0`?

2x-3x speed up for specifying a parallel execution policy on reasonably large examples seems deeply disappointing. Author didn't specify how many cores the examples were run on, but ouch.

Best of all is the volume of comments here predicting that these annotations will eventually be deprecated ..

I really want to like C++.

MichaelMoser123 · on June 27, 2018

i wonder why they didn't have the execution policy of a template parameter - if they pass the object on the stack then they would have to switch on the type in runtime, with template parameter you can have a specialization on the type without that overhead.

Also what is exactly the difference between parallel_execution_policy and parallel_vector_execution_policy ?

fourthark · on June 27, 2018

It is a template parameter, they just use a global object of a particular type to select it, probably for easier defaulting.

E.g. https://en.cppreference.com/w/cpp/experimental/reduce

MichaelMoser123 · on June 27, 2018

thanks, didn't notice the declaration - this way they can deduce the type without having to specify the template parameter directly in the call - and you don't have to pass an actual instance; clever. You never stop learning with c++

      template<class T>
      void foo(T &&);

Someone · on June 27, 2018

”you’ll get the same results for accumulate and reduce for a vector of integers (when doing a sum), but you might get a slight difference for a vector of floats or doubles. That’s because floating point operations are not associative.”

That’s incorrect. Addition of int isn’t associative in C++, either. For example

  INT_MAX + (1 + -1) = INT_MAX

but

  (INT_MAX + 1) + -1

is undefined.

reikonomusha · on June 27, 2018

Can you give an example of non-associative integer arithmetic where both interpretations are defined but differ in value? This is the case with floats.