
How GPUs Work - luu
http://www.cs.virginia.edu/~gfx/papers/paper.php?paper_id=59
======
ak217
This overview, while a great start, doesn't really dive into the details of
how modern GPUs work. Since 2007, many of the limitations that held GPUs back
from being general-purpose computers have been removed (by relentless efforts
of NVIDIA and to a lesser extent ATI/AMD, spurred in large part by NVIDIA's
traction in the supercomputing space, for example
[http://en.wikipedia.org/wiki/Titan_%28supercomputer%29](http://en.wikipedia.org/wiki/Titan_%28supercomputer%29)).
My go-to source for a lot of these developments is AnandTech
([http://www.anandtech.com/tag/cuda](http://www.anandtech.com/tag/cuda)) but
I'm sure there are plenty of other resources others can point to.

Another fascinating bit is that NVIDIA and ATI/AMD have developed what are now
the largest general-purpose processors in the world (over 5 billion
transistors per chip and counting - available in consumer GPUs for under $300,
as opposed to Intel's largest Xeons that top out at 4 billion and cost $2000+)
but are being held back at the 28nm process because their fab partner (TSMC)
is oversubcribed by smaller, higher demand ARM chips that go into phones.

~~~
mej10
What are you meaning by general-purpose here? Do you not have to use a
different programming model anymore?

~~~
dantillberg
The parent is referring to how CUDA's introduction enabled developers to write
and compile C-ish code to run on a GPU, while previously programmers could
take advantage of the power of GPUs for non-render computations by hacking the
pixel shaders and such, bending the graphics hardware to do something it was
not designed for.

You still have to write your program in a very different way in order to run
efficiently on GPUs as opposed to CPUs.

~~~
mej10
The comparison to a Xeon is what confused me. I didn't think such a
development had taken place.

------
yan
If anyone is even marginally interested in GPU internals, you'd do yourself a
favor by checking out John Owens' UC Davis class on the topic[1]. I once
watched the first lecture just to fill an hour and ended up going through the
entire course within the span of a week, following up with my own research
later on. Superbly interesting.

[1] [https://itunes.apple.com/us/itunes-u/graphics-
architecture-w...](https://itunes.apple.com/us/itunes-u/graphics-architecture-
winter/id404606990?mt=10)

or on youtube:
[https://www.youtube.com/playlist?list=PL4A8BA1C3B38CFCA0](https://www.youtube.com/playlist?list=PL4A8BA1C3B38CFCA0)

~~~
Joky
Thanks!!

See also [https://fgiesen.wordpress.com/2011/07/09/a-trip-through-
the-...](https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-
pipeline-2011-index/)

------
slackito
Anyone interested in how GPUs work should read the series of blog posts by
Fabian Giesen "A trip through the graphics pipeline":
[https://fgiesen.wordpress.com/2011/07/09/a-trip-through-
the-...](https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-
pipeline-2011-index/)

------
fat0wl
I'm really sorry I will try to dig up the source myself, but I've read
basically the opposite argument in a few technical papers -- that GPU is NOT
as fast as claimed for many classic test algorithms (actual speed-up is more
like factor of 2 than 10) and that the performance gap between traditional
CPUs and GPUs is actually lessening.

I'm going to read this article anyway to hear their take & for the learning
experience, but does anyone remember any of the counter-arg articles?

~~~
tmurray
disclaimer: I work in this space and have done so for a while, including
previously on CUDA and on Titan.

GPUs for general purpose computation were never 100x faster than CPUs like
people claimed in 2008 or so. They're just not. That was basically NV
marketing mixed with a lot of people publishing some pretty bad early work on
GPUs.

Lots of early papers that fanned GPU hype followed the same basic form: "We
have this standard algorithm, we tested it on a single CPU core with minimal
optimizations and no SIMD (or maybe some terrible MATLAB code with zero
optimization), we tested a heavily optimized GPU version, and look the GPU
version is faster! By the way, we didn't port any of those optimizations back
to the CPU version or measure PCIe transfer time to/from the GPU." It was
utterly trivial to get any paper into a conference by porting anything to the
GPU and reporting a speedup. Most of the GPU related papers from this time
were awful. I remember one in particular that claimed a 1000x speedup by
timing just the amount of time it took for the kernel launch to the GPU
instead of the actual kernel runtime, and somehow nobody (either the authors
or the reviewers) realized that this was utterly impossible.

GPUs have more FLOPs and more memory bandwidth in exchange for requiring PCIe
and lots of parallel work. if your algorithm needs those more than anything
else (like cache), can minimize PCIe transfer time, and handles the whole
massive parallelism thing well, then GPUs are a pretty good bet. If you can't,
then they're not going to work particularly well.

(now, if you need to do 2D interpolation and can use the texture fetch
hardware on the GPU to do it instead of a bunch of arbitrary math... yeah,
that's a _huge_ performance increase because you get that interpolation for
free from special-purpose hardware. but that's incredibly rare in practice)

~~~
fat0wl
ah, yes. :) very nice detailed summary of some of the issues in this sect of
"academia" (I put that in quotes only because all the research seems to be co-
written by corps).

I am into audio DSP & am planning to port a couple of audio algorithms (lots
of FFT & linear algebra) to run on GPU but haven't even gotten to it because I
considered it a pre-mature optimization to this point. I'm sure it would
improve performance, but nowhere near what GPU advocates would claim.

My biggest reason? "PCIe transfer time to/from GPU", plus it would be
unoptimized GPU code. Once you read a few of these papers it becomes painfully
obvious that a lot of tuning goes into the GPU algorithms that offer anything
more than a low single-digit factor of speedup. It's still very significant
(cutting a 3 hour algorithm down to 1 would be huge) but if you're in an early
stage of research it may be a toss-up over whether its better to just tune the
algorithm itself / run computations overnight rather than going through the
trouble of writing a GPU-based POC. Maybe if you have 1 or 2 under your belt
its not such a big deal but for most of the researchers I know GPU algorithm
rewrites would not be trivial. (I've been doing enterprise Java coding for
about 2 years now so the idea isn't so intimidating now, but in a past life of
mucking around with Matlab scripts I'm sure it would have been daunting).

------
userbinator
One thing that's always put me off from studying GPUs in detail is the
proprietariness of everything; with few exceptions (Intel being one of them
recently, and surprisingly enough Broadcom for the RPi), there's no detailed
datasheet or low-level programming information publicly available for modern
GPUs, and what is available is still not all that complete. Contrast this with
CPUs where a lot of them have full, highly-detailed information on everything
from pinouts to how to get them to boot. People have made their own simple
computer systems by wiring up a CPU on a circuit board with some support
chips, but I don't think I've seen anything like this done for any reasonably
recent or even ancient GPU.

(I know there are VGA reimplementations available, and the VGA is quite well-
documented, but that's more of a timing controller/dumb frame-buffer than a
real GPU.)

~~~
dfox
Significant part of functionality of moder GPU is in software that abstracts
away differences between different models and generations, from this point of
view it does not make much sense to document actual interface between software
and hardware. Other thing is that complexity of this software abstraction
layer is comparable to the GPU itself and manufacturers do not expect that
somebody would want to implement all that from scratch (this is similar to
e.g. FPGAs, where even when you know bitstream format, you still have to write
something non-trivial that generates the bitstream).

~~~
abecedarius
You could make the same arguments against documenting the machine code of a
CPU.

~~~
dfox
For CPU, there is no another processor that can run all the abstraction
software so it has to be done in hardware or in software in a way that is
transparent to user (microcode, Transmeta-style JIT...).

~~~
abecedarius
That's an implementation detail: the manufacturer supplies the system
software, and by this argument you're not supposed to care where it runs.

------
yazaddaruvala
Maybe off topic, but I'm actually really surprised that monitors and GPUs are
still different pieces of hardware.

I'll admit I only know the basics of GPU architecture, so please
forgive/correct me if I'm wrong about something. However, I am just too
curious not to share.

I'll try to explain. A frame buffer is nothing but a bunch of 1s and 0s in
memory, meanwhile a monitor is just a bunch of 1s and 0s in pixels. We
currently have the GPU write to memory in parallel and we currently write
pixels to a monitor serially (and therefore interlacing). However, given the
similarity between memory and pixels, why then can't we optimize a GPU to
(parallely) write to pixels instead of memory. To the extreme, you could
optimize, your GPU to have 1 shader per pixel, and since the shaders all run
on the same clock cycle, the whole monitor would update simultaneously. I
think that would be really cool and more importantly efficient. In more
practical terms you would probably have 1 GPU shader be responsible for some
group of pixels (so you only need 1 shader per 4x3 pixels or per 16x9 pixels).

So, before you say it, I get you might disagree with me when it comes to
desktop GPUs, since 1. the GPU memory needs to be close to RAM (you don't want
to have the GPU memory be on the other side of a "long" cable) 2. You would
like to update the hardware for a GPU separately from your monitor. However,
in something like Mobile/Oculus, the form factor is so small/tightly coupled
already, I'm surprised optimizations like this aren't being looked into.

Am I just not up to date? Is there something fundamentally wrong in my logic?
Does getting rid of the frame buffer/interlacing, not provide as much of a
boost to make this worth while?

~~~
com2kid
A number of problems. A huge one is wiring. Parallel is really complicated
electrically, noise drowns out your signal and things run slow. This is why
most of our buses have switched over to serial (e.g. USB, PCIe, etc).
Sometimes we run those serial busses in parallel, but that still works out to
being easier.

Timing is another huge one. Imagine running 2 million wires (for a 1080p
display) that have to all be the exact same length to within some tolerance.

The longer those wires gets the harder this gets. This is also another huge
reason why the move to serial buses has happened. You can run 4 wires with
really tight timings and the bits will fly, but if you try and run 16 wires
all together, speed ends up dropping dramatically. Reality is that circuit
boards don't have room for a large number of traces running in parallel of all
the exact same length!

RAM is a huge exception to this, but extreme measures have been taken to
enable this to happen, a good chunk of your Mobo is taken up getting the RAM
connected, and memory controllers moved onboard the CPU in part to get RAM
closer to the CPU to simplify traces,

Note this is all the perspective of a software guy who has to listen to the
hardware team grumble for most of the day. :)

------
ryanseys
I'm currently taking an introductory course in computer graphics and we've
been taught most of the things covered in this article, including the theory
of the Phong lighting model and the graphics pipeline with different types of
shaders. This is more or less a 25,000-foot overview of how computer graphics
works and how the images on your screen came to be. It's highly interesting
stuff and knowing a small amount of the math behind how it works really gives
me an appreciation for the things I see in video games and 3D animations. I
wish this article had gone further to explain how the GPU actually produces
results in the highly-parallel way that this articles seems to skim over.

------
pkaye
I wish there was a good book on GPU architecture and even micro-architecture.
I just like reading about this stuff and how they work.

~~~
oneofthose
There is an excellent slide deck by Kayvon Fatahalian [0] that I consider to
be the best high-level introduction into the topic (especially if you have a
deeper understanding of how a CPU works). But I agree, more detailed insights
would be great.

[0] [http://s08.idav.ucdavis.edu/fatahalian-gpu-
architecture.pdf](http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf)

~~~
tmurray
Kayvon teaches at Carnegie Mellon now and his class slides are definitely
worth reading:

[http://graphics.cs.cmu.edu/courses/15869/fall2014/](http://graphics.cs.cmu.edu/courses/15869/fall2014/)
[http://15418.courses.cs.cmu.edu/spring2014/](http://15418.courses.cs.cmu.edu/spring2014/)

------
mkagenius
Can't access the link, server overloaded?

"Description: Could not connect to the requested server host. "

Is there any other link for the paper?

~~~
teraflop
Try the direct PDF link:
[http://www.cs.virginia.edu/~gfx/papers/pdfs/59_HowThingsWork...](http://www.cs.virginia.edu/~gfx/papers/pdfs/59_HowThingsWork.pdf)

------
JabavuAdams
The title should be changed to reflect that this is from 2007. The graphics
that the article praises now look dated.

------
nemothekid
The 8800GTX was the first GPU I every bought back in 2007 (obviously I'm not
very old). Now 7 years later, its funny how dated the render on "Figure 2."
is.

~~~
semi-extrinsic
When I started university I had a laptop with a 80 MHz, 8MB memory discrete
GPU ;) I believe it was an ATI Rage LT Pro.

