
Examples of Parallel Algorithms from C++17 - joebaf
https://www.bfilipek.com/2018/06/parstl-tests.html
======
glangdale
I like the approach for generality, but for something like the word count
example, there's absolutely no way that parallelizing at the multi-core level
would beat doing this in SIMD. None of those examples are topping at at more
than a gigabyte per second, which is walking pace for this sort of problem.

Further, I suspect that the extremely regular pattern of the space boundaries
is probably giving the branch predictor a leg up. It almost certainly would
pick up the 'every 5' pattern if not the 'every 17' pattern. So a realistic
example would be far more branch miss heavy.

This is still fine work, as the effort of writing a nice C++ function and
getting some parallel help on it is tiny compared to writing SIMD code, and
some problems don't play nice with SIMD at all.

~~~
pjmlp
Thing is, not very many developers actually understand how to break an
algorithm into SIMD, whereas making use of parallel algorithms is a much more
accessible.

------
w8rbt
Looking forward to coroutines. I really want to see how they compare to Go's
implementation (which is great).

~~~
jstimpfle
FWIW, you can implement cooperative multithreading yourself, by using the
Fiber API on Windows, or with a few unstandardized function calls on
Linux/probably other unixes.

It works beautifully (of course, putting the fact that it's slightly "magical"
aside). It's much more comfortable to write Javascript-style programs in C
than it is in Javascript (where you need to chain callbacks) this way :-)

~~~
infinite8s
You don't need to chain callbacks with await/async anymore.

------
snowAbstraction
I really wonder what the C++ community will think of execution policies in 5
years. Using them seems to be interesting step up from coding your low level
parallel algorithms and instead using standard C++ algoritms with an execution
policy to build your algorithm.

But are they an abstraction that will be useful in the future? Worth it for
the language?

~~~
ridiculous_fish
I think execution policies will prove too abstract and coarse in their current
state, so won't be very useful.

Normally it's good to have pleasant abstractions and generic functions. But
execution policies are there for speed and speed alone: if they don't make
things faster, they've failed. So in this case it's worth tolerating a more
awkward API to achieve better performance.

The C++17 parallelism interface is more or less to declare "this loop can be
parallelized," which does not provide enough information to the scheduler. You
need to express properties of the algorithm, such as its optimal chunk size.
(In the future perhaps those properties will enter into the execution policy
as e.g. constructor parameters.)

Second, passing in callbacks which receive one or two elements at a time is
too granular. You are at the mercy of the compiler to optimize these. A better
approach would be for the callbacks to receive a pair of iterators and allow
the user to drive the loop. This makes it easier to avoid accidental
indirections, failed alias analysis or inlining, etc.

Intel's TBB will continue to be more useful here.

~~~
AtlasBarfed
It does seem that custom policies are needed when deadlockable resource pools,
throttling, or per-processor type of things need to be handled.

~~~
geezerjay
I don't believe these functions are supposed to be one-size-fits-all
solutions. I assume they are intended as the answer to "how come
std::transform() doesn't run this embarassingly parallel problem in parallel?"

------
waynecochran
I'd like to compare parallel CPU and GPU versions with Thrust:
[http://thrust.github.io](http://thrust.github.io) When is a vector big enough
that is worth processing on the GPU?

~~~
pavanky
It depends on the type of vector operation(s) you are doing and the machine
you are on. For one off vector operations it is never worth it to make a
transfer to the gpu.

If you are going to have a lot of temporary vectors as part of a larger
algorithm, it is usually beneficial to copy the inputs once, do all the
computations on the gpu, and copy them back.

~~~
cma
With the caveat that integrated GPUs with unified memory can skip the copy.

~~~
gpderetta
You would still need to transfer the data from the core L1/L2 to the GPU (same
as for inter core communication). While cheaper than a copy though the pci bus
it is not free.

------
jjnoakes
It's a shame the parallel recursive file size sum example reads all of the
file and directory names into memory first, and then iterates over them in
parallel after. A clean concise version of the same example, in a stream-like
but still parallelized way, is not obvious to me at first glance (although I
don't touch C++ often, perhaps someone here knows of a good way?)

~~~
burntsushi
Indeed, I've actually written such a thing (although in Rust, not C++), I'm
still looking for a clean concise version. :-)

Here's the public API of mine[1], which is pretty blah, because it takes a
closure (per-thread initialization) that must return a closure (executed on
each entry in the tree). The most interesting part of the implementation is
here[2], along with a suspicious termination argument.

Outside of that, ucg (written in C++) also has a recursive directory
iterator.[3]

I also know of two written in Go, one in sift[4] and another in the platinum
searcher[5]. The ones in Go seem considerably simpler as they fall back to
goroutines.

I haven't carefully benchmarked all of these, but if memory serves, they're
all faster than the single threaded variants. In my experience, the hardest
part about implementing this is coming up with a solid termination argument
because your consumers are also your producers. (I documented this a bit in
`get_work` in my Rust code.)

[1] -
[https://docs.rs/ignore/0.4.2/ignore/struct.WalkParallel.html](https://docs.rs/ignore/0.4.2/ignore/struct.WalkParallel.html)

[2] -
[https://github.com/BurntSushi/ripgrep/blob/b38b101c77003fb94...](https://github.com/BurntSushi/ripgrep/blob/b38b101c77003fb94aaaa8084fcb93b6862586eb/ignore/src/walk.rs#L940-L1378)

[3] -
[https://github.com/gvansickle/ucg/blob/master/src/libext/Dir...](https://github.com/gvansickle/ucg/blob/master/src/libext/DirTree.cpp)

[4] -
[https://github.com/svent/sift/blob/2ca94717ef0bbf43068f217c8...](https://github.com/svent/sift/blob/2ca94717ef0bbf43068f217c8a0dfb216df52225/sift.go#L159-L376)

[5] -
[https://github.com/monochromegane/the_platinum_searcher/blob...](https://github.com/monochromegane/the_platinum_searcher/blob/master/walk.go)

~~~
jjnoakes
Yeah, I've written such a thing many times as well, in languages like Go which
make it easy and languages like C which make it not-as-easy and a few in
between.

I was more curious if there was a cleaner C++ way of doing it with all of the
recent or pending languages changes.

------
saywatnow
Nice to see a fairly thorough set of examples, but my only conclusion is that
C++ really has become a parody of itself.

The language describing each policy is awful, but okay.

"so concise code" for 10 lines consisting mostly of ceremony to get the
results of one std function into a vector?

Does the first example with std::transform::reduce really need
`std::uintmax_t{ 0 }` twice instead of `0`?

2x-3x speed up for specifying a parallel execution policy on reasonably large
examples seems deeply disappointing. Author didn't specify how many cores the
examples were run on, but ouch.

Best of all is the volume of comments here predicting that these annotations
will eventually be deprecated ..

I really want to like C++.

------
MichaelMoser123
i wonder why they didn't have the execution policy of a template parameter -
if they pass the object on the stack then they would have to switch on the
type in runtime, with template parameter you can have a specialization on the
type without that overhead.

Also what is exactly the difference between parallel_execution_policy and
parallel_vector_execution_policy ?

~~~
fourthark
It is a template parameter, they just use a global object of a particular type
to select it, probably for easier defaulting.

E.g.
[https://en.cppreference.com/w/cpp/experimental/reduce](https://en.cppreference.com/w/cpp/experimental/reduce)

~~~
MichaelMoser123
thanks, didn't notice the declaration - this way they can deduce the type
without having to specify the template parameter directly in the call - and
you don't have to pass an actual instance; clever. You never stop learning
with c++

    
    
          template<class T>
          void foo(T &&);

------
Someone
_”you’ll get the same results for accumulate and reduce for a vector of
integers (when doing a sum), but you might get a slight difference for a
vector of floats or doubles. That’s because floating point operations are not
associative.”_

That’s incorrect. Addition of _int_ isn’t associative in C++, either. For
example

    
    
      INT_MAX + (1 + -1) = INT_MAX
    

but

    
    
      (INT_MAX + 1) + -1
    

is undefined.

~~~
reikonomusha
Can you give an example of non-associative integer arithmetic where both
interpretations are defined but differ in value? This is the case with floats.

