

Porting a renderer from C++ to CUDA - the speed gains and their cost - profquail
http://users.softlab.ece.ntua.gr/~ttsiod/cudarenderer.html

======
malkia
I work at gamedev studio, and we briefly tried CUDA for DXT compression. Now
the speedup (the actual compression part was roughly x10 - x20 faster).

But there were couple of problems (solvable, but might require changing of
habits):

    
    
       - Once you have CUDA running, you can't Remote Desktop to the machine (you can do VNC). We are still on Vista (and we stared trying this on XP). Things might be better with Windows 7. 
         Alternative: VNC
    
       - The IT requires every unattended machine to be logged off after 15 minutes. This means that CUDA might stop working for you (no video driver). Same somtimes happen if you do Ctrl+Alt+Del and brought the Task Manager (that's inconsitent - it happens rarely).
    
       - The biggest problem was that there were 2-3 models (DELL, then HP) machines. We always use Intel with NVIDIA, but now and then every machine would convert to DXT the texture data a little bit different. Visually no problem, but this was messing up MD5 sums of the produced image, and locally people were getting different results than the one stored on the cached asset server. 
         Alternative: Keep the MD5 sums of the source images + arguments for encoding (though this would not detect cases where bad images were encoded).
    
    

We had one very smart guy working for 6 months on new radiosity tool. He
started with CUDA solution, but moved back to multithreaded/process solution
with SSE code. For radisotiy (precomputed lighting for the levels) data we did
not kept MD5 sums, so it did not matter.

So for us what worked sanely was OpenMP - speeding up the nvidia compressor by
processing blocks with height of 4, and length of the original image - another
smart graphics programmer came up with the idea.

That to be said - it's still exciting, but too hard for a lot of common tasks.

~~~
tmurray
The Remote Desktop issue is actually fixed as of CUDA 3.2 if you have a Tesla
card that is only doing compute (e.g., you're not running a Tesla C2050 as a
display card).

full disclosure: I work on the CUDA driver stack and the dedicated compute
driver for Vista/Win7 is my baby :)

~~~
malkia
Thank you! Thank you! We don't have Tesla's, but I'm dreaming of buying one
(or whatever else there is latest and greatest from NVIDIA).

------
mfukar
Or as it was cleverly put before the author: " _Primary rays cache; secondary
rays thrash_ "

Some interesting things to note, at least from my point of view:

* GPUs should not be considered like the next generation CPUs at this stage.

* Trying to take advantage of CUDA/OpenCL not only requires redesigning your algorithm and altering your data structures, but also the folly of implementing in software what is already available in hardware.

* Thread shared memory on nVidia cards isn't exactly similar to a cache. There's also a lot of papers in the last two years that speak of nothing else but altering algorithms to make better use of CUDA, because for anything else but the simplest of raytracers on a few objects, the situation gets really bad.

Very nice post, though. Neatly organized and well written, I really enjoyed
it.

~~~
sharpneli
What stops using multiple passes to render the image? One for primary rays and
then invoking another pass for the secondary rays.

~~~
sparky
Nothing. In fact, everyone does this at least to first order; you need to do
bundles of primary rays at a time to take advantage of their spatial (and
thus, memory access pattern) coherence. The problem is that the secondary rays
are not a uniform grid like the primary rays, and they could be pointing any
which way, depending on scene geometry.

What you _can_ do is attempt to group up a bunch of secondary rays that appear
to be pointing roughly the same direction (e.g., if their primary rays all
reflected off the same flat, specular object), and do them in a batch,
exploiting their spatial coherence. Whether the process of finding coherent
secondary rays is less costly than just processing secondary rays in the same
order as their primary rays, again, depends on scene geometry.

------
malkia
Maybe you should take a look at this:

<http://home.comcast.net/~tom_forsyth/larrabee/larrabee.html>

"Rasterization on Larrabee and SIMD Programming With Larrabee: Michael Abrash
and I doing our double-act. We both talk about the instruction sets, Michael
talks about the hierarchical descent rasterisation algorithm, and I talk about
how we do basic language structures such as conditionals and flow control with
our 16-wide vector units. We were both absurdly proud to be able to finally
talk about the architecture we'd worked on for so long - it's not every
programmer that gets to design their own instruction set."

------
jakozaur
There are quite a few clever tricks to manage irregular problems. See poster:
[http://www.nvidia.com/content/GTC/posters/2010/A06-Task%20Ma...](http://www.nvidia.com/content/GTC/posters/2010/A06-Task%20Management-
for-Irregular-Workloads-on-the-GPU.pdf)

Also a presentation on OptiX shows how to implement ray tracer:
[http://www.nvidia.com/object/gtc2010-presentation-
archive.ht...](http://www.nvidia.com/object/gtc2010-presentation-
archive.html#session2250)

Ray-tracing could be done well on GPU (see OptiX), but it is not trivial and
needs some tricks (persistant threads, work donation etc. which are not that
common on CPUs)

------
Geee
The BRIGADE real-time path tracer developed by Jacco Bikker uses hybrid
rendering utilizing both the CPU and the GPU as much as possible.
<http://www.youtube.com/watch?v=Jm6hz2-gxZ0>

------
pmjordan
It's not quite clear to me why _rasterising_ of all things is slow. I realise
GPUs have a separate rasterisation unit, but other than that, the ALUs are
_designed_ for this type of workload. I haven't experimented with latter-era
GPGPU APIs and languages, but random memory access in a basic rasteriser
sounds suspicious. Bouncing rays? Sure, that'll destroy any locality of
reference, but mapping triangles into screen space? No way.

~~~
sparky
Rasterization is slow when you have lots of small (often sub-pixel) triangles.
Mapping a big triangle into screen space is fast, but mapping many triangles
per pixel (and hopefully throwing out the 99+% of the geometry that won't show
up at all) into screen space and blending them together in a convincing way
takes a while.

Real-time graphics does the best it can with a few big triangles (by e.g.
texturing and bump-mapping them) out of necessity, but people are moving
towards cases that are harder on the rasterizer.

------
AnthonBerg
I've worked with CUDA, and his negative conclusions are a little too dramatic
compared to how elegantly he used CUDA. (That said, the biggest cost of CUDA
development is the year or so it takes to let it all sink in. ALL of it!)

~~~
ttsiodras
Well it was a weekend project, so I am not sure it is exactly... elegant. But
thank you, I accept the compliment :-)

As for my negative conclusions... I wouldn't describe them as "negative", just
"objective". Some algorithms are easily adapted to CUDA, and you get an easy
win of 10-40x. Others need a lot more work to offer speed gains, and finally,
there are some that are simply hopeless (you have to redesign them from
scratch).

