
Basic GPU design concepts - fdsdsa
https://iq.opengenus.org/basic-graphics-processing-unit-gpu-design-concepts/
======
Const-me
IMO too simplified.

> Each vector is transformed in screen space

It’s first transformed into clip space. This is important because some
primitives / some parts of them are clipped out on this stage.

> Fragments are shaded to compute a color at each pixel.

Not all of them are shaded. If you’re programming at this level it’s very
important to understand early Z rejection, otherwise you’ll waste too much
resources computing pixel shaders for invisible objects.

Also, this article creates impression that what you see on screen is made out
of shaded + textured triangles. I don’t think it’s the case, at least not for
modern games. They render dozens passes per frame, each render pass renders
some stuff into textures, next pass reads from these textures and write to
some other places. See this article for detailed explanation for one specific
game, GTA5: [http://www.adriancourreges.com/blog/2015/11/02/gta-v-
graphic...](http://www.adriancourreges.com/blog/2015/11/02/gta-v-graphics-
study/) Other modern games usually do some conceptually similar stuff. Not all
of these passes even render any shaded triangles: compute shaders are now
used, too, when they fit better.

------
monocasa
There's also this, that goes a little more in depth.

[https://fgiesen.wordpress.com/2011/07/09/a-trip-through-
the-...](https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-
pipeline-2011-index/)

I'll also throw out there that since the Nvidia 8800, the basic layout is
still the same, but some of the fixed function responsibilities are being
handled by the programmable cores now. Not enough that any stages have been
removed, but stuff like interpolation of the baycentric coordinates, and part
of the ROP workload are handled by the pixel shaders (added in the driver not
your pixel shaders' code).

~~~
je42
How Unreal renders a frame is also interesting:

[https://interplayoflight.wordpress.com/2017/10/25/how-
unreal...](https://interplayoflight.wordpress.com/2017/10/25/how-unreal-
renders-a-frame/)

------
jszymborski
Not sure why the link is to the discourse forum... the article is easier to
read at this link [0].

I don't know much about "GPU design concepts", but this doesn't seem like it
explains an awful lot. It appears to be a collection of largely unexplained
figures about rasterization and some vague unexplained diagrams about GPU
pipelines.

[0] [https://iq.opengenus.org/basic-graphics-processing-unit-
gpu-...](https://iq.opengenus.org/basic-graphics-processing-unit-gpu-design-
concepts/)

~~~
sctb
Thanks, we've updated the link from [https://discourse.opengenus.org/t/basic-
graphics-processing-...](https://discourse.opengenus.org/t/basic-graphics-
processing-unit-gpu-design-concepts/1353).

------
Eridrus
One of the things I didn't understand about GPUs until I tried to program one
is that they're not just chips with thousands of cores: they're chips with
thousands of cores that (roughly) share an instruction pointer, so branching
is very slow.

~~~
montecarl
What's interesting, and I don't quite understand, is that even for problems
with a decent amount of branching, they can still be surprisingly fast. I
wrote a path tracer as an OpenGL shader that had a lot of branching and didn't
use any special data structures for ray intersections and it was still much
faster than running it on my CPU.

So in my example, the GPU for each pixel has to find out of a ray collides
with any object in the scene, and then scatter that light off of that object
up to some maximum number of scattering or until it leaves the scene. This
results in a variable number of branches per pixel (between about 1 and 16),
but still gets good performance.

~~~
dahart
> What's interesting, and I don't quite understand, is that even for problems
> with a decent amount of branching, they can still be surprisingly fast.

If your threads are uniformly spread between 1 and 16 branches, then you're
probably always paying for 16 branches, and you could make it a lot faster by
grouping similar workloads or getting rid of branches.

But yes, branchy code can be fast as long as almost all threads do the same
thing. Branches don't automatically cost extra. What matters is how many
threads in your wavefront / warp / work group are executing the same branch.
If they all take the first branch in an if-else, and no threads in the warp
take the else clause, then you don't pay for execution of both branches. But
if one thread in the warp does take the else path, then all the threads in the
warp pay the cost of executing both paths.

The story is starting to change for the newest AMD & NVIDIA GPUs, they are now
supporting per-thread instruction pointers, and parallel divergent execution.
But there are big restrictions and caveats, and the high level bit hasn't
changed: it is still best if all threads in a warp all do the same thing.

~~~
westoncb
> ... and you could make it a lot faster by grouping similar workloads

Could anyone clarify what's meant by that, or more concretely how it might be
done?

~~~
dahart
One way people do this is to group pixels by which material shader needs to
execute. You want all threads in a warp to execute the same material shader,
if at all possible. One way to do this is to use a deferred shading
architecture, and do some kind of radix sort by shader id in between the
visibility and shading passes.

Using the GP's example of ray tracing with 1-16 branches, if you can figure
out in advance, or even just estimate, how many branches you're going to take,
you could sort, or even create 16 separate work queues. Assuming we're talking
about recursion that involves identical code for each branch, then by grouping
threads into chunks that are likely to execute only 1 branch, your entire warp
will (hopefully) execute 1 branch, and it will finish in 1/16th of the time
that it would take if any one of the threads went the full 16 branches.

If you're not doing graphics, the way people do this kind of stuff is to have
some kind of mapping function on the (virtual) thread id that lets them re-
arrange the order of events. You have complete control over what the thread id
means, so it doesn't have to point to a memory location nor handle your data
in a consecutive order. (Of course, you will lose cache coherence for out of
order memory access, but that might be small compared to divergence problems.)

------
frob
According to the top of the page, this was a "30 minute" read. It was 2 pages
of text/images with minimal explanation which terminates in the middle of a
section. It feels like something was only partially uploaded/copied.

------
tokyodude
This is probably not introduction level but

[http://aras-p.info/texts/files/2018Academy%20-%20GPU.pdf](http://aras-p.info/texts/files/2018Academy%20-%20GPU.pdf)

I think that was posted on HN recently?

~~~
blauditore
Wow, I just quickly scrolled through it and am no expert, but it looks super
extensive.

------
charlysl
You may find lecture 22 [1] of MIT's 6.823 "Computer System Architecture -
Spring 2017" [2] a good intro to GPU architecture.

[1]
[http://csg.csail.mit.edu/6.823/lectures/L22.pdf](http://csg.csail.mit.edu/6.823/lectures/L22.pdf)

[2]
[http://csg.csail.mit.edu/6.823/syllabus.html](http://csg.csail.mit.edu/6.823/syllabus.html)

------
person_of_color
Any chip startups trying to compete with NVIDIA for ML workloads?

~~~
why_only_15
not a startup obvious but Google made the TPU and open sourced it. Pain in the
ass to get stuff working sometimes, but other times it can be pretty easy e.g.
with keras_to_tpu_model[0] which I used and found to be more or less a magic
bullet.

[0]:
[https://www.tensorflow.org/api_docs/python/tf/contrib/tpu/ke...](https://www.tensorflow.org/api_docs/python/tf/contrib/tpu/keras_to_tpu_model)

~~~
twtw
> made the TPU and open sourced it

In no sense is the TPU open source. If having an open source framework that is
able to use the hardware doesn't mean it is open source - or if it does mean
that to you, GPUs are open source too.

