
AMD GCN Radeon Support in GCC 9 - edelsohn
https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00680.html
======
stochastic_monk
I’ve never written any code for the GPU. Does anyone with more experience have
an idea how similar writing GCN code for gcc is to standard CPU C/C++?

~~~
atq2119
So the funny thing is that gcc's backend for GCN is actually much closer to
writing for standard CPUs than any other way of programming GPUs, because you
directly program it at the wave-level.

You see, the (not so) dirty secret of GCN is that from a very real
perspective, the best way to think about it is that it's a bunch of CPU cores
with extremely wide masked SIMD units.

Almost all GPU programming languages obscure that fact, and their compilers do
magic behind the scenes to make it appear as if you were programming
individual threads without SIMD.

GCC is the odd one out, which is really quite fascinating.

~~~
elcritch
Do the SIMD units have different memory blocks that while being controlled by
one cpu core would make the super wide SIMD behave more like multiple threads
with different memory units? Or at least act like it from a latency
perspective?

~~~
atq2119
I guess you're talking about scatter / gather? All vector memory instructions
in the GCN ISA are scatter / gather-type instructions, i.e. you provide one
pointer (or buffer index, or texture coordinates) per SIMD lane.

Docs say that the performance depends on the actual distribution of pointers,
i.e. for best performance you should ensure that the pointers are consecutive
(use SoA instead of AoS layouts, etc.), but I imagine that the drop in
performance is somewhat gradual and graceful as your pointers become more
scattered. I don't think the details are really documented though, and I
wouldn't be surprised if those details changed between GCN generations.

~~~
justinclift
Sounds like it'd be a useful benchmark to do + publish for people's reference.
May also lead to some tuning choices upstream too, depending on the results.
:)

------
baddash
Just a note: in the e-mail thread, it's only stated as a possibility that this
patchset will be implemented in time for GCC 9. The only real news is that
this patchset has been greenlit.

By the way; I'm confused trying to make sense of the effects of this patchset.
In Andrew Stubbs' original e-mail, he states that the patchset which is
discussed in this e-mail thread is for the "non-OpenACC/OpenMP portions" of
the port. Not only that, but only C and Fortran are supported, C++ is
explicitly unsupported, and everything else is untested.

So then, that means definite effects of implementing this patchset will be
that the front-end which is non-OpenACC/OpenMP for C and Fortran will be
operational, plus whatever the effects are of the patches with unspecified
details for different areas (backend, config, and testsuite) which he
mentioned briefly.

Given these definite effects, what are the noteworthy or most important
effects of them? How much does this "power on" GCN?

~~~
_pmf_
> C++ is explicitly unsupported

Given that C++ now has a defined memory model, it might very well be that it's
now impossible to implement C++ in a standard conforming way on exotic
architectures.

~~~
wolfgke
> Given that C++ now has a defined memory model, it might very well be that
> it's now impossible to implement C++ in a standard conforming way on exotic
> architectures.

Also C has a defined memory model since C11

> [https://davmac.wordpress.com/2018/01/28/understanding-
> the-c-...](https://davmac.wordpress.com/2018/01/28/understanding-the-c-c-
> memory-model/)

and it is the same as the memory model from C++11:

>
> [https://en.wikipedia.org/w/index.php?title=Memory_model_(pro...](https://en.wikipedia.org/w/index.php?title=Memory_model_\(programming\)&oldid=834038513)

"After it was established that threads could not be implemented safely as a
library without placing certain restrictions on the implementation and, in
particular, that the C and C++ standards (C99 and C++03) lacked necessary
restrictions, the C++ threading subcommittee set to work on suitable memory
model; in 2005, they submitted C working document n1131 to get the C Committee
on board with their efforts. The final revision of the proposed memory model,
C++ n2429, was accepted into the C++ draft standard at the October 2007
meeting in Kona. The memory model was then included in the next C++ and C
standards, C++11 and C11.".

------
01100011
Curious how this stacks up against CUDA(from a programming perspective, almost
guarantee CUDA is faster). Does this provide a way to manage/copy memory back
and forth from the GPU? Or is it just allowing you to compile some code for
the GCN and the rest is up to you?

~~~
baybal2
Why CUDA is faster than OpenCL?

~~~
sharpneli
It used to be on par on nv hardware. Then nvidia just stopped improving their
OpenCL backend.

Otherwise assuming no fancy features are used they are identical in their
programming model.

~~~
TomVDB
AFAIK the part of identical programming models isn’t the case anymore since
Volta, because Volta and Turing have now progression guarantees when you have
intra-warp divergence due to the presence of unique program counters for each
thread.

Before that, intro-warp divergence combined with badly placed synchronization
operations could result in hard hangs.

I don't think this makes a different in terms of raw low level performance,
but it might have an impact in terms of implementation algorithms that require
synchronization?

------
fulafel
It's a notable change that GPU instruction architectures are now stable enough
for this kind of thing to happen.

------
bibyte
Let's hope GPU programming becomes as accessable as CPU.

~~~
m_mueller
From the perspective of high performance computing (e.g. achieve a significant
fraction of what the hardware is capable of for a given problem) I've found
GPU programming always more accessible since the introduction of CUDA -
assuming that the problem is something for which GPU makes sense in the first
place (e.g. throughput is dominating over latency).

That is to say, I'd much rather optimize GPU kernels for things like
sequential access, register use and programmable caches than make it vectorize
on the various versions of AVX, treat multi-core parallelization separately
and then fit everything into the fastest CPU cache possible just because the
CPU's memory bandwidth is so damn slow.

~~~
bibyte
CUDA is pretty great. But I ditched it in favor of OpenCL because it's closed
source and limited to Nvidia GPUs.

