

Tim Sweeney on the twilight of the GPU - jparise
http://arstechnica.com/articles/paedia/gpu-sweeney-interview.ars

======
reitzensteinm
I think he's right, but since his company sells engines he's also got a vested
interest in the conversion to software rendering coming to pass.

It's a hell of a lot easier to build up a passable 3D engine from scratch
using DirectX + Shaders etc than it will ever be starting from scratch and
implementing those same algorithms in software. This will spread the field a
lot - the best engine developers will be able to do amazing stuff with the
complete control over rendering pipeline software 3D will provide, so the gap
between the best engines and average in house ones will increase even further.

Which is a great thing if you develop and license out one of the top 3D
engines on the market.

~~~
eru
And for the general programmer using it - he won't notice much of a difference
in what the libraries do under the cover.

~~~
reitzensteinm
Exactly. Intel will be providing a reference DirectX driver for Larrabee,
which I believe they're considering making open source, so for low end engines
it'll be business as usual.

------
preview
A return to software rendering could present an interesting opportunity for
Linux. A major barrier against using Linux is the lack of availability of
A-list games. If a game no longer requires DirectX, then it would seem to be
an easier task to port it to many operating systems. It would not be a magic
fix, but it would be one more step toward the possibility of a mass market
Linux on the desktop.

~~~
jcl
_If a game no longer requires DirectX, then it would seem to be an easier task
to port it to many operating systems._

I don't think this is what is holding back Linux games. Many game companies
already design their games to not rely on DirectX -- none of the games on the
PS3, Wii, Mac, or various portables require DirectX.

But they will only port to platforms where they think they can make money.
Linux's low gamer population doesn't help, nor does the "everything's free"
user expectation.

------
abstractbill
_...the 3D hardware revolution sparked by 3dfx in 1997 will prove to only be a
10-year hiatus from the natural evolution of CPU-driven rendering._

[http://www.catb.org/jargon/html/W/wheel-of-
reincarnation.htm...](http://www.catb.org/jargon/html/W/wheel-of-
reincarnation.html)

------
thalur
I think it will be interesting to see the variety of rendering paradigms that
come from this, but on the other side, it seems like the transition will be a
bit of a nightmare - having to produce games with a both DirectX engine and a
software one to cope with two different sets of hardware. It would also be
nice if game developers put more effort into the gameplay than the graphics,
but I guess I'll be waiting a while longer for that...

As an aside, has anyone tried getting something like erlang to run on a GPU? I
think it would be useful to be able to harness all the computing power of one
(or more) PCs with a common (high level) language which is designed for that
purpose.

~~~
reitzensteinm
Edit: Oops, 20 minutes typing that out and I realised I've misread your
question - I read it as rendering with Erlang, not Erlang on GPU.

Actually, even with hand rolled C you'd have a hard time getting real time
rendering performance out of a network of computers. 10-100 white boxes have
fearsome throughput, but their latency will probably never be suitable for
real time use.

The first painful problem is the processing power of a GPU versus a CPU. A
general rule of thumb is that a GPU (or in general, a many core, stream
processor) will have more than 10x the peak FLOPS per transistor. The QX9775
versus the 8800 Ultra according to Wikipedia is 51 gflops versus 576 - and
note that the 8800 Ultra cost significantly less when released!

Then there's the suitability of the two processor types for rendering.
Graphics cards have the benefit of specialised instructions (think SSE) to
help with things like antialiasing, I believe Larrabbee will have quite a few
of these not found on x86. Then there's the cache model - stream processors
are designed for the data to stream through the processor (hence the name),
not for random access. This is exactly what you want for rendering, and I
think you'd probably find you'd get more cache misses on a modern Intel CPU
because you can't precache from memory into cache. Let's call it a 3x penalty
between these two problems. I have no idea what it really is.

So we're at requiring 30 QX9775s for each 8800 Ultra, which would probably be
30 skulltrail (2xQX9775) versus an 8800 SLI setup.

Now, rendering at 60 fps you've got to render in 16ms. This actually turns out
to not be much time in a networked environment.

First, you've got to push out all of the data that you would have pushed onto
the graphics card over PCI-X. 32x runs at 8gb/s, which is 8 times faster than
Gig-E, but you have to distribute the information to 30 computers. The fastest
way to do this is to stream the information to one other machine, which in
turn starts streaming to another, and so on - so you're transfering at 500mbps
(or 1gbps, I think, if in full duplex?). This is far less efficient than just
using PCI-X, though! If you've got to send a lot of data, you're probably
screwed. 16ms at 500mbps is 1 megabyte - that is enough to stream new
rotation/positions for preloaded geometry, but it's not a lot of triangles
(considering the processing power of your setup), and it's nothing in terms of
textures. And that's if just the streaming out portion eats up your whole 16ms
- you'll probably want to keep it down in the 1-2ms range, which means a few
hundred kb, at the theoretical throughput of the network!

Now, each of the machines render their rectangle. Note that this doesn't scale
linearly. Imagine rendering just one pixel - you've still got to process all
of the geometry that can potentially modify that pixel. So each of your 30
machines are only going to be rendering 1/30 of the game, but they'll be far
less than 30x faster. Amdahl's law sometimes pops up in graphics, too!

Now, you've got to send the already rendered frame buffer back to machine
that's rendering. You'd probably do this as 32 bit frame (you want at least
one bit for the alpha, because you've got to have some way to mask, and
picking a colour for the mask means running through the whole image to find a
colour you've not used - there will always be one in 24 bit for <4096x4096).

Now, depending on how you're splitting the rendering up, this is going to add
up to 9 megabytes at 1920x1200 and... oops, I should have checked this before
I even started, because _that's too big to send back_. You can only send those
at 10fps on GigE uncompressed, and there's no way you'll be able to do
anything except RLE in 16ms - which, in general 3D rendering, won't do much.

Note that PCI-X doesn't require this send - since the graphics card runs the
display - but even on an SLI setup, they use a special interconnect for
communication between the cards since PCI-X itself isn't fast enough.

Wow, that was fun to write, but I doubt anyone is going to read it! Readers
digest version:

* 10x more FLOPS on a graphics card

* 3x the throughput per FLOP due to graphics specific optimisation (this is a wild guess)

* 30x the machine requirements, except that even 3D rendering does not scale linearly

* 60fps requires 16ms rendering, which allows for just 1mb of network traffic at GigE (or 2mb full duplex)

* You're sending back a 9mb frame buffer rendering a 1920x1200 image

* Fail.

Now, the picture would probably change if you used 10gbps ethernet, and put 4x
larrabee cards in each of a smaller number of boxes...

~~~
DaniFong
_Now, rendering at 60 fps you've got to render in 16ms. This actually turns
out to not be much time in a networked environment._

Not true. Human reaction times are 180 to 200 milliseconds for visual stimuli.
That's the limit on your pipeline. You still need to render once every 16 ms,
but it can take longer for each frame to get there than that.

~~~
reitzensteinm
Reaction times might be 180-200 milliseconds, but a delay in rendering (as in
a pipeline delay, so 60fps rendering but 180ms delay) will add to the reaction
time latency, it won't be masked by it. If you delay rendering by 180ms, then
the screen will change 180ms after you hit a key, and you'll only notice the
result after your reaction time for a total of 360ms.

I'd bet even 50ms will feel a bit sluggish subconsciously (people won't like
your controls but they won't know why), but delaying by 180-200ms will make
your character noticably lag input.

I could easily whip something up to simulate it if you'd like.

~~~
DaniFong
That's true, but my main point was you don't really have time to react to
something being 'out of sync'. It might feel a little sluggish, but only just.
(I suspect)

~~~
reitzensteinm
Hmm, I think you're probably underestimating the impact. I've made a little
test program that you can play with yourself - if you've got a Windows machine
around at all:

<http://www.rocksolidgames.com/latencytest.exe>

Source is at:

<http://www.rocksolidgames.com/latencytest.bmx> (written in BlitzMax)

I can't notice 1 frame of latency (16ms per frame), but 2 doesn't feel quite
right, and by 4 frames (~64ms), the controls clearly feel lagged to me. Of
course, considering my day job I play a lot of games and troubleshoot problems
like this, so I'm probably more sensitive than average to something like this.
My gut feeling is that anything over 40-50ms is unacceptable. So you could
have a render pipeline three stages deep at 60fps, but there's no way I'd go
deeper than that. Also, if you've got any external latency in a game (i.e.
multiplayer net play), the budget is already blown and any kind of latency at
all is unacceptable.

~~~
DaniFong
Wow, thanks for this demonstration. There are still a lot of surprises in
sensation, it seems.

~~~
reitzensteinm
You're welcome! It was educational for me too, I had never tested render
latency in isolation.

------
pmorici
This article is a bunch of bunk. Given the current speed advantage of the GPU
over the CPU, the more likely scenario is that the GPU will become more
generalized ie: NVIDIA CUDA.

The article is right we are returning to an earlier paradigm but it isn't the
CPU. It's having to have a knowledge of your hardware's capabilities and how
to program it for good performance.

