
Revisiting the Business Card Raytracer - vanni
https://fabiensanglard.net/revisiting_the_businesscard_raytracer/index.html
======
bigcheesegs
> Opening the binary with Binary Ninja revealed that clang had already managed
> to leverage the SSE registers.

X86-64 uses SSE registers for all floating point operations. I'm not sure that
the author realized that they were looking at an -O0 binary. -O0 does not do
vectorization (or anything else for that matter).

~~~
slavik81
Looking at it on Godbolt, it doesn't really leverage SSE on -O3, either. You
can get a reasonable grasp of whether it's using SSE effectively or not just
by looking at the instruction names.

mulss: multiplication of a single single-precision floating point value.

mulsd: multiplication of a single double-precision floating point value.

mulps: multiplication of a packed group of single-precision floating point
values.

mulpd: multiplication of a packed group of double-precision floating point
values.

If you're mostly seeing -ps suffixes only on moves and shuffles, you're
looking at code that is not being vectorized. (And, actually, if you're seeing
a lot of shuffles, that's also a good sign its not well-vectorized.)

Incidentally, if you're seeing unexpected -sd suffixes, those are often due to
unintended conversions between float and double. They can have a noticeable
effect on performance, especially if you end up calling the double versions of
math functions (as they often use iterative algorithms that need more
iterations to achieve double-precision).

I'm linking GCC output, because it's simpler to follow, but you see more or
less the same struggle with Clang.

[https://godbolt.org/z/XtVqsU](https://godbolt.org/z/XtVqsU)

~~~
K0nserv
This post is really topical for me. I spent hours yesterday trying to write
explicit SIMD code[0] for my the vector dot product in my raytracer and all I
managed to do was slow the code down about 20-30%.

The code generated by Rust from the naive solution uses ss instructions mostly
whereas my two tries using `mm_dp_ps` and `mm_mul_ps` and `mm_hadd_ps` where
both significantly slower even though it results in fewer instructions. I
suspect that the issue is that for a single dot product the overhead of
loading in and out of mm128 registers is more cost than it's worth.

Naive Rust version output

    
    
        .cfi_startproc
        pushq %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq  %rsp, %rbp
        .cfi_def_cfa_register %rbp
        vmovss  (%rdi), %xmm0
        vmulss  (%rsi), %xmm0, %xmm0
        vmovsd  4(%rdi), %xmm1
        vmovsd  4(%rsi), %xmm2
        vmulps  %xmm2, %xmm1, %xmm1
        vaddss  %xmm1, %xmm0, %xmm0
        vmovshdup %xmm1, %xmm1
        vaddss  %xmm1, %xmm0, %xmm0
        popq  %rbp
        retq
    

My handwritten version with `mm_mul_ps` and `mm_hadd_ps`

    
    
        .cfi_startproc
        pushq %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq  %rsp, %rbp
        .cfi_def_cfa_register %rbp
        vmovaps (%rdi), %xmm0
        vmulps  (%rsi), %xmm0, %xmm0
        vhaddps %xmm0, %xmm0, %xmm0
        vhaddps %xmm0, %xmm0, %xmm0
        popq  %rbp
        retq
    

Intuatively it feels like my version should be faster but it isn't. In this
code I changed the the struct from 3 f32 components to an array with 4 f32
elements to avoid having to create the array during computation itself, the
code also requires specific alignment not to segfault which I guess might also
affected performance.

0: [https://github.com/k0nserv/rusttracer/commits/SIMD-
mm256-dp-...](https://github.com/k0nserv/rusttracer/commits/SIMD-mm256-dp-ps)

~~~
slavik81
I'm actually rather bad at this, but my understanding is that the horizontal
operations are relatively slow. The easiest way to get throughput out of SIMD
is to have a structure representing 4 points, with a vec4 of your X values, a
vec4 of your Y values, and a vec4 of your Z values. Then you can do 4 dot
products easily and efficiently, using only a handful of vertical packed
instructions. (If figuring out how to effectively use a structure like that
sounds difficult and annoying, that's because it is.)

~~~
K0nserv
Yeah that was my conclusion too. I don't think I have any cases where the need
to perform the dot product between multuple paris of vectors arise however, at
least not anywhere in the hot loop where it would help. Raytracers tend to use
a lot of dot products followed by some checks then another dot product but
there's a strictly sequential process in these algorithms.

~~~
a_e_k
I suspect that most people attempting to vectorize their ray tracer try the
straightforward horizontal approach here and discover that it's not really any
faster. I certainly did when I first started in this area.

The key to the vertical, structure-of-arrays approach is to move to a more
data-flow style approach. Instead of generating a single ray at a time from
your camera code you can generate them in small batches corresponding to a
rectangle on the screen. When I was in grad school we'd call these ray
packets. We'd also do things like organize them into little 2x2 pixel
"packlets" within the packet to take advantage of SSE or AltiVec. (This tends
to work great with coherent pinhole camera rays, and shadow rays towards a
common light source, but not so well with path-traced indirect rays which
quickly become incoherent.)

Likewise, don't go all the way down to single primitives in the leaves of your
acceleration structure. Instead, try to group them into small blocks of the
same primitive type that are clustered together spatially. For example, you
might have triangle mesh intersector code that has a BVH specialized for large
numbers of pure triangles (generic programming can help here, e.g.,
BVH<triangle> for meshes and BVH<sphere> for point clouds). Since the leaves
may have different numbers of primitives, a typical approach would be to pack
all the primitives into a single flat array and then have your BVH leaves just
give you the indices of the range within the array. (The nice thing is that
this typically also shrinks your acceleration structures since you have fewer
nodes due to the leaves not being so fine-grained.)

If you're curious to see some vectorized ray/triangle intersection code that
was fairly state-of-the-art in its day, I have an old paper[0] on the topic.
I'd also suggest having a look at Intel's Embree[1], especially the early
publications. The library is open source (Apache licensed), well organized,
and not too large.

[0]
[https://www.cs.utah.edu/~aek/research/triangle.pdf](https://www.cs.utah.edu/~aek/research/triangle.pdf)

[1] [https://www.embree.org/](https://www.embree.org/)

------
rwmj
Is there a mistake in the original code on the right hand side? I get:

    
    
      card.cpp:16:2: error: ‘g’ was not declared in this scope
         16 | <g;p)t=p,n=v(0,0,1),m=1;for(i k=19;k--;)
            |  ^
    

Edit: Yes there is. The ‘<g;’ seems like it should have been the single
character ‘<’, perhaps a corrupted HTML escape.

~~~
a_e_k
(Original author of this code[0] here.) Yes, that looks like a corrupted copy.
The line should be:

    
    
        <p)t=p,n=v(0,0,1),m=1;for(i k=19;k--;)
    

[0]
[http://www.cs.utah.edu/~aek/code/card.cpp](http://www.cs.utah.edu/~aek/code/card.cpp)

------
danielscrubs
Well there goes my weekend. :)

I also tried to optimise the code, and got great speed increases with just
constexpr the vector methods and could quickly see that rand was problematic
and then Fabien releases this post with nvcc that are another level. Really
great blog post!

------
ectoplasmaboiii
A Ray-Tracer in 7 Lines of K:
[http://nsl.com/k/ray/ray.k](http://nsl.com/k/ray/ray.k)

~~~
jagged-chisel
7? Just remove those comments and the last few newlines and it’ll be one line.

I don’t know K, but it looks like it uses semicolons to end statements. It’s
“cheating” on line count to compress statements by just removing \n.

After all, how many “lines” are in the business card raytracer? 4.

------
blondin
i love what fabien is doing with his website!

also been experimenting with pure html with an itsy-bitsy amount of css. for
months now i wondered how to display code without involving javascript.

that textarea is so perfect! and i bet you when you copy and paste into word
or your todo list application they won't even try to be "smart" about knowing
what "rich text" is...

that's very cool.

~~~
GuiA
i am glad that there is a small village of indomitable developers holding out
against tens of megabytes of reactive functional progress modern javascript
frameworks.

thank you

~~~
wetmore
There is a place for both.

~~~
GuiA
It’s not so much that there’s a place for both rather than both necessarily
exist as two points in the same space.

But the reality is that more websites than not these days will send you many
megabytes of JS, mainly for the purpose of tracking you and extracting
money/time from you, under the guise of “user experience”.

So when I see some of those rare people who still actually care about quality,
speed, performance, accessibility, etc I make sure to appreciate their work.

------
rrss
This was really fun to read, thanks fsanglard.

> This is correlated with the warning nvcc issued. Because the raytracer uses
> recursion, it uses a lot of stacks. So much actually that the SM cannot keep
> more than a few alive.

Stack frame size / "local memory" size doesn't actually directly limit
occupancy. There's a list of the limiters here:
[https://docs.nvidia.com/gameworks/content/developertools/des...](https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm).
I'm not sure why the achieved occupancy went up after removing the recursion,
but I'd guess it was something like the compiler was able to reduce register
usage.

------
fegu
Almost into passable frame rate territory. Next version could be business card
VR:)

------
ntry
101,000ms to 150ms is a phenomenal speedup. Props

~~~
correct_horse
To be fair, the thing he started with was totally unreadable and fits on a
business card and the thing he ended up with was readable, but didn't fit. It
isn't apples-to-apples.

~~~
sp332
The original is 1,287 bytes after removing comments and line endings where
possible. After just a bit of search-and-replace to shrink variable names, I
got the new one down to 2,373 bytes including the same error checking and
output at the end. So 1.8 business cards, which is not too bad.

And no I haven't tried to compile it.
[https://pastebin.com/LDRd6U4e](https://pastebin.com/LDRd6U4e)

~~~
a1369209993
Hmm, some further improvements:

    
    
      typedef float F;typedef int I;
      // saves 2 lines (but zero bytes)
      
      #define R return
      // OR
      #define O(S,A,R) operator S(A){return R;}
      
      #define E(F){E_((F||cudaPeekAtLastError)(),__FILE__,__LINE__);}
    

and so on - although you'd probably have to make semantic chages to get onto a
single card. (Or use a smaller font, but that's presumably cheating.)

~~~
sp332
I feel like the self-timing and error-checking is unnecessary, and you can at
least fit it on the front + back of a business card with enough room left for
your name and an email address.

------
mianos
I wonder if the tool-chain would be better under Linux? It is kind if funny
the way Windows development has always been a hassle. Mscvars.bat and such has
been there for at least 20 years.

------
tomsmeding
Nice work on the GPU programming, and the multicore before that, but I'm
mystified why going from -O0 to -O3 is named an "optimisation". All respect
for Fabien, but running code that's supposed to run faster than a snail (and
if you're not debugging and require -O0 for reasonable output) implies -O2 or
-O3. (In practice, -O3 often doesn't give much performance over -O2, despite
increasing compile times.)

The initial time is not 101.8 seconds, it's 11.6 seconds.

~~~
slavik81
You could also add -ffast-math, which loosens the rules for floating point
optimizations. For example, it would allow the compiler to turn floating point
divisions into multiplications by an inverse, and to group operations more
efficiently even if doing so would slightly affect rounding. It also rounds
denormal numbers down to zero, which can greatly improve performance on a lot
of hardware.

-march=native may also be useful, as it would allow the compiler to use newer CPU instructions, and tune the generated code to your hardware. That would make the program less portable, but it's not like CUDA is portable either.

My machine matches those numbers surprisingly closely. With -O0 it took 89.6s.
With -O3, it took 11.7s. With -Ofast (which combines -O3 and -ffast-math), it
took 10.6s. With -Ofast -march=native, it took 8.9s. I would expect those
gains to extrapolate to the multi-threaded version, maybe pushing it down to 1
second without any further work. (Note: I'm using GCC on Ubuntu 18.04 with a
Haswell i7. Your mileage may vary.)

~~~
amelius
Offtopic, but I think that languages should have special float types that
trigger the use of fast math. That way, a programmer can better control which
parts of a program are done with approximate floating point operations.

~~~
slavik81
One of the reasons why I think Zig looks appealing is that you can set the
policy for these sorts of things on a per-block basis:
[https://ziglang.org/documentation/master/#setFloatMode](https://ziglang.org/documentation/master/#setFloatMode)

~~~
amelius
Imho that's too fine-grained. Usually you determine which variables don't need
strict accuracy rather than which code, so that's better controlled through
types. Also, it's easy to limit approximations to code blocks by casting
to/from approximate types around a code block so you can still have fine
grained control.

------
lonk
If you increase resolution you can put more code on business card :P

~~~
Y_Y
If you make your business cards out of silicon wafer I can get you 10nm
resolution.

