Hacker News new | comments | ask | show | jobs | submit login
AMD GCN Radeon Support in GCC 9 (gnu.org)
207 points by edelsohn 33 days ago | hide | past | web | favorite | 40 comments

I’ve never written any code for the GPU. Does anyone with more experience have an idea how similar writing GCN code for gcc is to standard CPU C/C++?

I've written a few Reddit posts on the subject:



There's a 3rd story here: GCC's style is OpenMP 4.5. The big news is that the following kind of code is now possible:

    #pragma omp target map(a,b,c,d)
        #pragma parallel for
        for(int i=0; i<1000000; i++){
            a[i] = b[i] * c  + d ;
The above code will run on the GPU. GCC, CLang, and other languages are trying to support OpenMP 4.5's ability to offload code to GPUs.

That's what this GCC9 news is about: AMD GPUs are gaining the ability to execute the above code, integrated into OpenMP Tasks and all that good stuff.

Hasn't this worked via HSAIL for years?

HSAIL depends on an ecosystem of non-free software components. Granted, it's an uphill battle to get the concert of moving pieces that would be needed for 3d/compute acceleration cards to work on a fully open source stack.

That said, getting GCC to do a better job implementing code for those is a big step in that direction.

So the funny thing is that gcc's backend for GCN is actually much closer to writing for standard CPUs than any other way of programming GPUs, because you directly program it at the wave-level.

You see, the (not so) dirty secret of GCN is that from a very real perspective, the best way to think about it is that it's a bunch of CPU cores with extremely wide masked SIMD units.

Almost all GPU programming languages obscure that fact, and their compilers do magic behind the scenes to make it appear as if you were programming individual threads without SIMD.

GCC is the odd one out, which is really quite fascinating.

> You see, the (not so) dirty secret of GCN is that from a very real perspective, the best way to think about it is that it's a bunch of CPU cores with extremely wide masked SIMD units.

This is not quite doing it justice. GPU threads allow for branching (which thus includes things like early return). They do loose performance because inactive branches are sleeping (although I didn't test this with Volta and onward, which is supposedly brings some improvements), but that's way better than just breaking vectorization outright. This makes it much easier to port some naive (or just cache-optimized) piece of numerics code to GPU rather than trying to vectorize it on CPU.

How's that different from masking (i.e per-line predication)?

> How's that different from masking (i.e per-line predication)?

AMD GCN has hardware accelerated masking + stacks for this purpose. AMD's GPU is basically a combination of masking + fork/join commands, to split up wavefronts.

But its really just hardware accelerated masks. You sometimes need to store the masks + program counters in a stack to allow for divergent threads, and then pop the masks off and execute older commands... (Complex if / then / else branches where every thread takes a different set of branches).

the difference is that for masking you have to rely on the compiler to do-the-right-thing[TM] OR you have to program it yourself manually (which is quite annoying). With GPU threads the scheduling of branches is hardware supported, so it's much easier to map your scalar code to it as long as it's sufficiently data parallel.

That's the thing though: for AMD GCN, the compilers do have to do the right thing.

You can look at the actual hardware ISA, it's all documented e.g. here: https://rocm-documentation.readthedocs.io/en/latest/GCN_ISA_..., and both the HCC compiler and Mesa's OpenGL and Vulkan implementations are open-source and based on LLVM, where the backend is even upstream, so you can take a look for yourself.

Essentially everything those compilers do would be possible on a CPU with masked SIMD instructions (e.g. AVX512). In fact, that's basically how ISPC works. The one exception is that GCN has much more powerful scatter/gather support than CPUs do (and graphics-focused stuff like texture sampling hardware). But as far as the execution model is concerned, it's really the same.

I see, that's interesting. So essentially, AMD's hardware schedulers are not concerned with branching the same way Nvidia seems to be doing it?

Pascal does branching in a similar manner.

Volta / Turing does branching per-PC now, but it is opt-in optional. In any case, I expect Volta / Turing to prefer the old SIMD model of compute.

The only reason Volta / Turing supports per-PC branching is to handle some multi-thread communication issues. The execution units are still SIMD, so you take a heavy penalty going to the per-PC model.

PS: I haven't actually programmed Volta/Turing, its just my understanding from the white paper.

Do the SIMD units have different memory blocks that while being controlled by one cpu core would make the super wide SIMD behave more like multiple threads with different memory units? Or at least act like it from a latency perspective?

From a memory perspective you can view it as multiple threads. Caches are a bit more complicated (each "cpu core"/streaming processor has it's own L1 cache, and cache coherency between L1 caches isn't nessesarily guaranteed). Memory bandwidth and latency are still bottlenecks.

The bigger implication of the SIMD design is that for any conditional all your "threads" should take the same route, otherwise you may end up computing both paths for every thread.

I guess you're talking about scatter / gather? All vector memory instructions in the GCN ISA are scatter / gather-type instructions, i.e. you provide one pointer (or buffer index, or texture coordinates) per SIMD lane.

Docs say that the performance depends on the actual distribution of pointers, i.e. for best performance you should ensure that the pointers are consecutive (use SoA instead of AoS layouts, etc.), but I imagine that the drop in performance is somewhat gradual and graceful as your pointers become more scattered. I don't think the details are really documented though, and I wouldn't be surprised if those details changed between GCN generations.

Sounds like it'd be a useful benchmark to do + publish for people's reference. May also lead to some tuning choices upstream too, depending on the results. :)

I've done a good bit of CUDA and OpenCL. I haven't dug into this GCC implementation, but my guess is that you're banned from using any features that require sophisticated runtime support (like exceptions or RTTI). Things like templates, which are a compile-time mechanism, should work fine.

Also, you still have the usual nasty issues of the GPU having a separate memory space from the CPU. This is just a code generation backend; not some transparent GPU system model for C/C++.

You'd usually write HLSL to target DXBC which is then compiled to GCN by the AMD driver. That's how it usually works in video games targeting DirectX 11 on PC.

So to answer your question, you wouldn't really use C/C++ to target GCN. I think this is more akin to how CUDA works by using a dialect on top of C.

CUDA supports most of C++14 and earlier: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

With few CUDA specific extensions.

In fact newer CUDA generations have been designed to run C++ code.

"CppCon 2017: Olivier Giroux "Designing (New) C++ Hardware”" https://www.youtube.com/watch?v=86seb-iZCnI

They even have special libraries, like thrust, for writing generic GPU code. Is it safe to say that this development for gcc doesn’t provide something comparable for GCN?

> You'd usually write HLSL to target DXBC which is then compiled to GCN by the AMD driver.

Why not target SPIR-V for Vulkan driver?

Just a note: in the e-mail thread, it's only stated as a possibility that this patchset will be implemented in time for GCC 9. The only real news is that this patchset has been greenlit.

By the way; I'm confused trying to make sense of the effects of this patchset. In Andrew Stubbs' original e-mail, he states that the patchset which is discussed in this e-mail thread is for the "non-OpenACC/OpenMP portions" of the port. Not only that, but only C and Fortran are supported, C++ is explicitly unsupported, and everything else is untested.

So then, that means definite effects of implementing this patchset will be that the front-end which is non-OpenACC/OpenMP for C and Fortran will be operational, plus whatever the effects are of the patches with unspecified details for different areas (backend, config, and testsuite) which he mentioned briefly.

Given these definite effects, what are the noteworthy or most important effects of them? How much does this "power on" GCN?

> C++ is explicitly unsupported

Given that C++ now has a defined memory model, it might very well be that it's now impossible to implement C++ in a standard conforming way on exotic architectures.

> Given that C++ now has a defined memory model, it might very well be that it's now impossible to implement C++ in a standard conforming way on exotic architectures.

Also C has a defined memory model since C11

> https://davmac.wordpress.com/2018/01/28/understanding-the-c-...

and it is the same as the memory model from C++11:

> https://en.wikipedia.org/w/index.php?title=Memory_model_(pro...

"After it was established that threads could not be implemented safely as a library without placing certain restrictions on the implementation and, in particular, that the C and C++ standards (C99 and C++03) lacked necessary restrictions, the C++ threading subcommittee set to work on suitable memory model; in 2005, they submitted C working document n1131 to get the C Committee on board with their efforts. The final revision of the proposed memory model, C++ n2429, was accepted into the C++ draft standard at the October 2007 meeting in Kona. The memory model was then included in the next C++ and C standards, C++11 and C11.".

C11 has pretty much the same memory model.

Curious how this stacks up against CUDA(from a programming perspective, almost guarantee CUDA is faster). Does this provide a way to manage/copy memory back and forth from the GPU? Or is it just allowing you to compile some code for the GCN and the rest is up to you?

OpenMP 4.5 provides the #pragma target data, target enter data, target exit data, and target update constructs to provide for the memory-management to-and-from the GPU.

omp_target_alloc, omp_target_memcpy, omp_target-associate_ptr provide some other utility functions that you'd expect for these use cases. omp_target_memcpy can copy from host-to-device, device-to-host, or even device-to-device.

This appears to be a GCN backend intended for use with OpenMP and friends, so not really comparable to cuda (more comparable maybe to the llvm nvptx backend).

Why CUDA is faster than OpenCL?

More work put into implementation, better (more optimized) libraries, some additional functionality that lets you do things faster in some circumstances. Cuda doesn't have to care about anything but nvidia gpus, opencl needs to run on anything.

It used to be on par on nv hardware. Then nvidia just stopped improving their OpenCL backend.

Otherwise assuming no fancy features are used they are identical in their programming model.

AFAIK the part of identical programming models isn’t the case anymore since Volta, because Volta and Turing have now progression guarantees when you have intra-warp divergence due to the presence of unique program counters for each thread.

Before that, intro-warp divergence combined with badly placed synchronization operations could result in hard hangs.

I don't think this makes a different in terms of raw low level performance, but it might have an impact in terms of implementation algorithms that require synchronization?

It's not faster from what I've heard. It just has more libraries around it.

It's a notable change that GPU instruction architectures are now stable enough for this kind of thing to happen.

Let's hope GPU programming becomes as accessable as CPU.

From the perspective of high performance computing (e.g. achieve a significant fraction of what the hardware is capable of for a given problem) I've found GPU programming always more accessible since the introduction of CUDA - assuming that the problem is something for which GPU makes sense in the first place (e.g. throughput is dominating over latency).

That is to say, I'd much rather optimize GPU kernels for things like sequential access, register use and programmable caches than make it vectorize on the various versions of AVX, treat multi-core parallelization separately and then fit everything into the fastest CPU cache possible just because the CPU's memory bandwidth is so damn slow.

CUDA is pretty great. But I ditched it in favor of OpenCL because it's closed source and limited to Nvidia GPUs.

GPU Programming is easier than AVX Programming on the CPU. That's why CUDA has become more popular.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact