
The Landscape of Parallelism in C++ [video] - adamnemecek
https://www.youtube.com/watch?v=rrolR1BdTok?
======
vardump
In high level parallel programming, say for SIMD or GPU (which really is just
a fancy name for a wide SIMD engine with a ton of hardware threads to hide
memory latency and a 2D-oriented gather/scatter memory controller), I wish
there was a way to still expose the situation when your data is in reality
shared by multiple "cores", when they're actually just different lanes in a
same SIMD/WARP/whatever unit. Right now you just pretend they're separate,
even when they can't even branch individually.

Example: Think about image processing algorithm, say 3x3 kernel. If you have
16-wide SIMD, using 3 registers, you only need special handling for the first
and last pixels, only they cross the processing boundary. 14 pixels in the
middle already have all of the data. If you can plan your memory accesses, you
can do even better by shifting a pixel in from each new 16 pixel fetch.

Through abstraction you need to gather the 9 (3x3 kernel) center+neighboring
values and run clamping for each neighbor access, even if you're not at the
boundary of the image data. Even when a good SIMD/GPU compiler notices the
data is already loaded, you still end up with needless saturation arithmetic
(clamp) and most likely extra memory loads, potentially many times as many.

Sure, swizzling for locality of reference and a good cache controller help,
but extra ALU work and memory accesses still mean sub-optimal performance.

This is where I've gotten good gains (up to 2-5x) when manually writing SIMD
asm or intrinsics.

That's a lot of performance to leave on the table, given that CPU and GPU
performance hasn't improved much in last 3 years.

This year we will get HBM1/2 [1] + more ALUs to use up that bandwidth for
GPUs.

Next year AVX-512 for x86 CPUs. I guess HBM comes to also CPUs in a few years.

My guess is after that next significant improvements come maybe 2020-2025, if
the current performance improvement trend holds.

Next step is probably CPUs absorbing GPU duties. So far memory bandwidth has
been an issue, but now you can stick HBM chips just as well in a CPU package
as in a GPU package. Just need some lower clocked ultra-wide cores.

[1]:
[https://en.wikipedia.org/wiki/High_Bandwidth_Memory](https://en.wikipedia.org/wiki/High_Bandwidth_Memory)

------
fredmorcos
<rant>

This talk was amazing, I was overwhelmed by the amount of work going into C++.
The problem is that it is already an abysmally complex language (to me) and it
doesn't look like anything is getting simpler. It's as if you had to be there
from the start to have a full image of how the language grew and truly
understand the interplay between all the "features", otherwise it's too late
to become an expert, and in some cases even too late to become comfortable
writing good maintainable code in C++.

Unfortunately nothing comes close to its level of "practical usefulness" in
all dimensions (expressiveness, reasonable speed, ecosystem, etc...). I
personally just stick to C because things are simple to understand and I just
live with its shortcomings.

~~~
ericmo
"If all you have is a hammer, everything looks like a nail". In C you have to
build everything from scratch. If you need classes, hack some structs. If you
need string concatenation, write some mallocs and strcpy, and so on. C is
cool, but really, there's no way C is more "practical" than C++11, and C++14
will be even better.

~~~
ThatGeoGuy
> and C++14 will be even better.

Perhaps you mean C++17? Almost the entirety of C++14 is supported in the
latest versions of all major compilers as of about the middle of last year.
There are some pieces held back (particularly in GCC pre-5.0 and MSVC),
however they are tiny parts of the standard which do not largely affect the
majority of changes you'll make from C++98 to C++11/C++14.

~~~
ericmo
You're right, C++14 is pretty much out there already - problem is I'm still
not using latest compilers, lol.

