It’s first transformed into clip space. This is important because some primitives / some parts of them are clipped out on this stage.
> Fragments are shaded to compute a color at each pixel.
Not all of them are shaded. If you’re programming at this level it’s very important to understand early Z rejection, otherwise you’ll waste too much resources computing pixel shaders for invisible objects.
Also, this article creates impression that what you see on screen is made out of shaded + textured triangles. I don’t think it’s the case, at least not for modern games. They render dozens passes per frame, each render pass renders some stuff into textures, next pass reads from these textures and write to some other places. See this article for detailed explanation for one specific game, GTA5: http://www.adriancourreges.com/blog/2015/11/02/gta-v-graphic... Other modern games usually do some conceptually similar stuff. Not all of these passes even render any shaded triangles: compute shaders are now used, too, when they fit better.
I'll also throw out there that since the Nvidia 8800, the basic layout is still the same, but some of the fixed function responsibilities are being handled by the programmable cores now. Not enough that any stages have been removed, but stuff like interpolation of the baycentric coordinates, and part of the ROP workload are handled by the pixel shaders (added in the driver not your pixel shaders' code).
Not sure why the link is to the discourse forum... the article is easier to read at this link [0].
I don't know much about "GPU design concepts", but this doesn't seem like it explains an awful lot. It appears to be a collection of largely unexplained figures about rasterization and some vague unexplained diagrams about GPU pipelines.
One of the things I didn't understand about GPUs until I tried to program one is that they're not just chips with thousands of cores: they're chips with thousands of cores that (roughly) share an instruction pointer, so branching is very slow.
What's interesting, and I don't quite understand, is that even for problems with a decent amount of branching, they can still be surprisingly fast. I wrote a path tracer as an OpenGL shader that had a lot of branching and didn't use any special data structures for ray intersections and it was still much faster than running it on my CPU.
So in my example, the GPU for each pixel has to find out of a ray collides with any object in the scene, and then scatter that light off of that object up to some maximum number of scattering or until it leaves the scene. This results in a variable number of branches per pixel (between about 1 and 16), but still gets good performance.
> What's interesting, and I don't quite understand, is that even for problems with a decent amount of branching, they can still be surprisingly fast.
If your threads are uniformly spread between 1 and 16 branches, then you're probably always paying for 16 branches, and you could make it a lot faster by grouping similar workloads or getting rid of branches.
But yes, branchy code can be fast as long as almost all threads do the same thing. Branches don't automatically cost extra. What matters is how many threads in your wavefront / warp / work group are executing the same branch. If they all take the first branch in an if-else, and no threads in the warp take the else clause, then you don't pay for execution of both branches. But if one thread in the warp does take the else path, then all the threads in the warp pay the cost of executing both paths.
The story is starting to change for the newest AMD & NVIDIA GPUs, they are now supporting per-thread instruction pointers, and parallel divergent execution. But there are big restrictions and caveats, and the high level bit hasn't changed: it is still best if all threads in a warp all do the same thing.
One way people do this is to group pixels by which material shader needs to execute. You want all threads in a warp to execute the same material shader, if at all possible. One way to do this is to use a deferred shading architecture, and do some kind of radix sort by shader id in between the visibility and shading passes.
Using the GP's example of ray tracing with 1-16 branches, if you can figure out in advance, or even just estimate, how many branches you're going to take, you could sort, or even create 16 separate work queues. Assuming we're talking about recursion that involves identical code for each branch, then by grouping threads into chunks that are likely to execute only 1 branch, your entire warp will (hopefully) execute 1 branch, and it will finish in 1/16th of the time that it would take if any one of the threads went the full 16 branches.
If you're not doing graphics, the way people do this kind of stuff is to have some kind of mapping function on the (virtual) thread id that lets them re-arrange the order of events. You have complete control over what the thread id means, so it doesn't have to point to a memory location nor handle your data in a consecutive order. (Of course, you will lose cache coherence for out of order memory access, but that might be small compared to divergence problems.)
The GPU is actually divided in group of units. The vendor use terms like wave or warp to descibe this unit. On such execution unit contains in the order of 32-64 ALU. When a branch is hit, if all unit do not go into the same branch, I believe they still all execute both branches. The ALU keepa a flag telling it which branch is live and the other branch basically execute as noop. So if a warp has a mix of branch taken and not taken, your execution time is proportional to the total cost of both branches added together. It's still faster than a CPU due to the huge number of ALU. (Plus, image coherence makes it that you will often hit only one branch in all ALU, which is optimized.)
To a first approximation, as long as spatially-local fragments all take the same branches, you can get good performance from branching on GPU. What matters is whether individual threadgroups ("warps", in NVIDIA speak) have coherent branches. The rasterizer makes an attempt to schedule nearby pixels on the same threadgroup, mostly to improve memory locality, but also to benefit workloads like yours.
In your case, if your scene objects are relatively large and nearby rays all tend to hit the same objects, then per-threadgroup branch divergence can be minimal.
According to the top of the page, this was a "30 minute" read. It was 2 pages of text/images with minimal explanation which terminates in the middle of a section. It feels like something was only partially uploaded/copied.
not a startup obvious but Google made the TPU and open sourced it. Pain in the ass to get stuff working sometimes, but other times it can be pretty easy e.g. with keras_to_tpu_model[0] which I used and found to be more or less a magic bullet.
In no sense is the TPU open source. If having an open source framework that is able to use the hardware doesn't mean it is open source - or if it does mean that to you, GPUs are open source too.
I am hoping* AMD Radeon Vega 7nm along with ROCm 2.0 will at least bring some much need competition to Nvidia. But the Dev head of ROCm recently followed Raja, Previous Head of Radeon Group to Intel and likely more to follow ( He already pouched many already, at this rate Radeon Group will be an empty Group by end of next year ). So I guess may be Intel will be the competitor instead?
> Each vector is transformed in screen space
It’s first transformed into clip space. This is important because some primitives / some parts of them are clipped out on this stage.
> Fragments are shaded to compute a color at each pixel.
Not all of them are shaded. If you’re programming at this level it’s very important to understand early Z rejection, otherwise you’ll waste too much resources computing pixel shaders for invisible objects.
Also, this article creates impression that what you see on screen is made out of shaded + textured triangles. I don’t think it’s the case, at least not for modern games. They render dozens passes per frame, each render pass renders some stuff into textures, next pass reads from these textures and write to some other places. See this article for detailed explanation for one specific game, GTA5: http://www.adriancourreges.com/blog/2015/11/02/gta-v-graphic... Other modern games usually do some conceptually similar stuff. Not all of these passes even render any shaded triangles: compute shaders are now used, too, when they fit better.