
Programming language ray tracing benchmarks project - jblindsay
https://github.com/niofis/raybench
======
kenhwang
Ordered by realtime, fastest to slowest for those like me who got annoyed by
the scrolling up and down trying to compare:

    
    
      Rust (1.13.0-nightly)         1m32.392s
      Nim (0.14.2)                  1m53.320s
      C                             1m59.116s
      Julia (0.4.6)                 2m01.166s
      Crystal (0.18.7)              2m01.735s
      C Double Precision            2m26.546s
      Java (1.7.0_111)              2m36.949s
      Nim Double Precision (0.14.2) 3m19.547s
      OCaml                         3m59.597s
      Go 1.6                        6m44.151s
      node.js (6.2.1)               7m59.041s
      node.js (5.7.1)               8m49.170s
      C#                           12m18.463s
      PyPy                         14m02.406s
      Lisp                         24m43.216s
      Haskell                      26m34.955s
      Elixir                      123m59.025s
      Elixir MP                   138m48.241s
      Luajit                      225m58.621s
      Python                      348m35.965s
      Lua                         611m38.925s

~~~
weberc2
That's a big jump between OCaml and Go. I'm not familiar with ray tracing, but
skimming the source code it mostly looks like it's doing floating point math;
it doesn't look like it's using the runtime (no allocations, no virtual
function calls, no scheduling, etc), so I'm surprised that Go is performing
relatively poorly.

I wonder if the performance gap is attributable to some overhead in Go's
function calls? I know Go passes parameters on the stack instead of via
registers... Maybe it's due to passing struct copies instead of references
(looks like the C version passes references)? Generally poor code generation?

Anyone else have ideas or care to profile?

EDIT: From my 2015 MBP, Go (version 1.12) is indeed quite a lot slower than C,
but only if you're doing an optimized build `-03`:

    
    
        tmp $  time ./gorb
        real    1m15.128s
        user    1m9.366s
        sys     0m6.754s
    
        tmp $  clang crb.c
        tmp $  time ./a.out
        real    1m13.041s
        user    1m10.284s
        sys     0m0.624s
    
        tmp $  gcc crb.c -o crb -std=c11 -O3 -lm -D_XOPEN_SOURCE=600
        tmp $  time ./crb
        real    0m22.703s
        user    0m22.550s
        sys     0m0.073s
    
        tmp $  clang crb.c -o crb -std=c11 -O3 -lm -D_XOPEN_SOURCE=600
        tmp $  time ./crb
        real    0m22.689s
        user    0m22.564s
        sys     0m0.060s
    

EDIT2: I re-modified the Go version
([https://gist.github.com/weberc2/2aed4f8d3189d09067d564448367...](https://gist.github.com/weberc2/2aed4f8d3189d09067d56444836772da))
to pass references and that seems to put it on par with C (or I mistranslated,
which is also likely):

    
    
        $ time ./gorb 
        real    0m19.282s
        user    0m14.467s
        sys     0m7.523s

~~~
jerf
There's a variety of possibilities. Lerc mentions GC as one possibility, which
could definitely be the case. Another one that would be high on my "first
guess" list is that everything above it has much better optimizers, and
raytracing code is one of the places this is really going to show. Go does
basically very little optimization, because it prioritizes fast compilation.

(Where Go "wants" to play is that same benchmark, except including compilation
time.)

A couple of the things below Go I suspect are bad implementations. I would
expect a warmed-up C# to beat Go if both have reasonable (not super-crazy
optimized implementations) or at least be at parity, and Luajit may also be a
slow implementation. In both cases because ray-traced code is a great place
for a JIT to come out and play. EDIT: Oh, I see C# is Mono, and not the
Windows implementation. In that case that makes sense.

Oh, and I find it helpful to look at these things logarithmically. I think it
matches real-world experiences somewhat better, even though we truly pay for
performance in the linear world. From that perspective, it's still only the
second largest. The largest is Haskell to Elixir, which is substantially
larger. O'Caml->Go is large, but not crazily so; several other differences
come close.

~~~
alkonaut
There are multiple ”levels” of performance in play here, and which level a
language performs on depends on the language, runtime and implementation.

The most naive level is e.g allocating heap objects for vectors, rays etc. On
that level the algorithm is probably bounded by pointed chasing, cache misses
and GC.

The next level up is an allocation-free loop (at least)

The best level is an optimized and allocation free. If the implementation
isn’t allowed to optimize (use SoA instead of AoS, manually vectorize, unroll
etc) then the winning languages will be the ones that have sophisticated
compilers such as those with LLVM backends.

As an example: The C# example should be on the second level here - but it has
a poor implementation (looks like it’s ported from java or written by a java
developer) so it’s actually stuck on the first naive level.

------
Someone
This times performance like this:

    
    
      $ time ./crb
    

That means time spent writing the .ppm file is included.

In the implementations I browsed, that is about a million _print_ calls, each
of which might flush the output buffer, and whose performance may depend on
locale.

To benchmark ray tracing I would, instead, just output the sum of the pixel
values, or set the exit code depending on that value.

Even though ray tracing is cpu intensive, it also wouldn’t completely surprise
me if some of the implementations in less mature languages spent significant
time writing that output because their programmers haven’t come around to
optimizing such code.

~~~
weberc2
The Go version spends 8 seconds writing data. This is probably par for the
course for most implementations.

------
piinbinary
The haskell version can be made >= 3x faster by making the computations non-
lazy, e.g.

    
    
        -data Vector3 = Vector3 {vx::Float, vy::Float, vz::Float} deriving (Show)
        +data Vector3 = Vector3 {vx :: !Float, vy :: !Float, vz :: !Float} deriving (Show)

~~~
niofis
This seems like a nice improvement, could you send a PR for it?

~~~
piinbinary
[https://github.com/niofis/raybench/pull/8](https://github.com/niofis/raybench/pull/8)

------
kyberias
I don't think the C# time is representative. I suspect Mono is really slow
here. I just ran it with VS 2015 in 1 min 24 sec.

~~~
alkonaut
The C# implementation looks flawed (uses reference types for vectors etc).
Using value types and .NET Core should give a much better result than that.
Will try to remember doing a PR.

~~~
ygra
I've just tried out a bit with .NET Core 2.2.

Baseline of the non-multithreaded variant on my machine: 1m56s

Making Vector3 a struct: 1m3s

Making Vector3 a readonly struct: 1m1s

Making Hit and Ray a struct: 1m26s

Will test more tomorrow, I guess, but the most obvious change already yields a
2× speedup. This was also without any profiling, so I don't even know what I
did there.

------
ilitirit
Interesting that Nim is slightly faster than C it considering that it compiles
down to C.

~~~
Lerc
That's possibly because of the "Code must follow best practices" restriction.

Oftentimes compile to C is "It's C Jim, but not as we know it"

You can write C as if it is a SSA VM or similar intermediate representation
that leaves very little work for the first stages of the compiler.

~~~
niofis
That's correct, as the nim's generated C code is not idiomatic.

------
kev6168
I am surprised PyPy has such a huge lead over Python.

    
    
        $ time python pyrb.py 
    
        real    348m35.965s
        user    345m51.776s
        sys     0m22.880s
    
        $ time pypy pyrb.py
    
        real    14m2.406s
        user    13m55.292s
        sys     0m1.416s

~~~
X6S1x6Okd1st
Raytracing is a pretty great place to apply pypy, you have a very heavy loop
that will hit the JIT.

I've certainly seen speedups like that on stuff like project euler code.

------
fnord77
> rustc 1.13.0-nightly

what's an ancient version of rust. Interesting it is faster than C, though.

~~~
Lerc
To be fair, The README.md seems to be three years old.

It would actually be quite interesting to see a comparison with all of the
languages using more recent builds to see which ones are developing their
performance.

------
JoshuaScript
You should see a performance boost in the Haskell implementation by compiling
with GHC's LLVM backend[0]. Another Haskell ray tracer ran 30 % faster than
the native codegen this way[1].

[0][https://gitlab.haskell.org/ghc/ghc/wikis/commentary/compiler...](https://gitlab.haskell.org/ghc/ghc/wikis/commentary/compiler/backends/llvm)

[1][http://blog.llvm.org/2010/05/glasgow-haskell-compiler-and-
ll...](http://blog.llvm.org/2010/05/glasgow-haskell-compiler-and-llvm.html)

------
azhenley
This is awesome! More good press for Nim.

There is a big variation in performance, some of which I find surprising. Do
you know what exactly causes some languages to be so slow (e.g., small objects
being created and garbage collected frequently)?

~~~
niofis
Did some testing and found that it boils down to three things: RNG algorithm
used in the standard library, forced use of double precision floating point
numbers (the case for OCaml and javascript), and like you mentioned, memory
management.

EDIT: forgot to mention the obvious things: compiler/interpreter maturity and
inherent overhead.

------
technological
Wonder how would D lang would have been placed

------
xiphias2
Looking at the Julia implementation fast math wasn't used. In my experience
it's usually worth experimenting with turning it on (also of course for the
other LLVM based languages), though I understand that this benchmark tries to
keep the program correct at all costs.

------
stevelosh
I looked over the Common Lisp version at
[https://github.com/niofis/raybench/blob/master/lisprb.lisp](https://github.com/niofis/raybench/blob/master/lisprb.lisp)
and it's… really bad, in a lot of ways.

    
    
        (declaim (optimize (speed 3) (safety 0) (space 0) (debug 0) (compilation-speed 0)))
    

_Never_ use `(optimize (safety 0))` in SBCL — it throws safety _completely_
out the window. We're talking C-levels of safety at that point. Buffer
overruns, the works. It _might_ buy you 10-20% speed, but it's not worth it.
Lisp responsibly, use `(safety 1)`.

    
    
        (defconstant WIDTH 1280)
    

People generally name constants in CL with +plus-muffs+. Naming them as
uppercase doesn't help because the reader uppercases symbol names by default
when it reads. So `(defconstant WIDTH ...)` means you can no longer have a
variable named `width` (in the same package).

    
    
        (defstruct (vec
                     (:conc-name v-)
                     (:constructor v-new (x y z))
                     (:type (vector float)))
          x y z)
    

Using `:type (vector float)` here is trying to make things faster, but
failing. The type designator `float` covers _all_ kinds of floats, e.g. both
`single-float`s and `double-float`s in SBCL. So all SBCL knows is that the
struct contains some kind of float, and it can't really do much with that
information. This means all the vector math functions below have to fall back
to generic arithmetic, which is extremely slow. SBCL even warns you about this
when it's compiling, thanks to the `(optimize (speed 3))` declaration, but I
guess they ignored or didn't understand those warnings.

    
    
        (defconstant ZERO (v-new 0.0 0.0 0.0))
    

This will cause problems because if it's ever evaluated more than once it'll
try to redefine the constant to a new `vec` instance, which will not be `eql`
to the old one. Use `alexandria:define-constant` or just make it a global
variable.

All the vector math functions are slow because they have no useful type
information to work with:

    
    
        (disassemble 'v-add)
        ; disassembly for V-ADD
        ; Size: 160 bytes. Origin: #x52D799AF
        ; 9AF:       488B45F8         MOV RAX, [RBP-8]                ; no-arg-parsing entry point
        ; 9B3:       488B5001         MOV RDX, [RAX+1]
        ; 9B7:       488B45F0         MOV RAX, [RBP-16]
        ; 9BB:       488B7801         MOV RDI, [RAX+1]
        ; 9BF:       FF1425A8001052   CALL QWORD PTR [#x521000A8]     ; GENERIC-+
        ; 9C6:       488955E8         MOV [RBP-24], RDX
        ; 9CA:       488B45F8         MOV RAX, [RBP-8]
        ; 9CE:       488B5009         MOV RDX, [RAX+9]
        ; 9D2:       488B45F0         MOV RAX, [RBP-16]
        ; 9D6:       488B7809         MOV RDI, [RAX+9]
        ; 9DA:       FF1425A8001052   CALL QWORD PTR [#x521000A8]     ; GENERIC-+
        ; 9E1:       488BDA           MOV RBX, RDX
        ; 9E4:       488B45F8         MOV RAX, [RBP-8]
        ; 9E8:       488B5011         MOV RDX, [RAX+17]
        ; 9EC:       488B45F0         MOV RAX, [RBP-16]
        ; 9F0:       488B7811         MOV RDI, [RAX+17]
        ; 9F4:       48895DE0         MOV [RBP-32], RBX
        ; 9F8:       FF1425A8001052   CALL QWORD PTR [#x521000A8]     ; GENERIC-+
        ; 9FF:       488B5DE0         MOV RBX, [RBP-32]
        ; A03:       49896D40         MOV [R13+64], RBP               ; thread.pseudo-atomic-bits
        ; A07:       498B4520         MOV RAX, [R13+32]               ; thread.alloc-region
        ; A0B:       4C8D5830         LEA R11, [RAX+48]
        ; A0F:       4D3B5D28         CMP R11, [R13+40]
        ; A13:       772E             JNBE L2
        ; A15:       4D895D20         MOV [R13+32], R11               ; thread.alloc-region
        ; A19: L0:   C600D9           MOV BYTE PTR [RAX], -39
        ; A1C:       C6400806         MOV BYTE PTR [RAX+8], 6
        ; A20:       0C0F             OR AL, 15
        ; A22:       49316D40         XOR [R13+64], RBP               ; thread.pseudo-atomic-bits
        ; A26:       7402             JEQ L1
        ; A28:       CC09             BREAK 9                         ; pending interrupt trap
        ; A2A: L1:   488B4DE8         MOV RCX, [RBP-24]
        ; A2E:       48894801         MOV [RAX+1], RCX
        ; A32:       48895809         MOV [RAX+9], RBX
        ; A36:       48895011         MOV [RAX+17], RDX
        ; A3A:       488BD0           MOV RDX, RAX
        ; A3D:       488BE5           MOV RSP, RBP
        ; A40:       F8               CLC
        ; A41:       5D               POP RBP
        ; A42:       C3               RET
        ; A43: L2:   6A30             PUSH 48
        ; A45:       FF142520001052   CALL QWORD PTR [#x52100020]     ; ALLOC-TRAMP
        ; A4C:       58               POP RAX
        ; A4D:       EBCA             JMP L0
    

If they had done the type declarations correctly, it would look more like
this:

    
    
        ; disassembly for V-ADD
        ; Size: 122 bytes. Origin: #x52C33A78
        ; 78:       F30F104A05       MOVSS XMM1, [RDX+5]              ; no-arg-parsing entry point
        ; 7D:       F30F105F05       MOVSS XMM3, [RDI+5]
        ; 82:       F30F58D9         ADDSS XMM3, XMM1
        ; 86:       F30F104A0D       MOVSS XMM1, [RDX+13]
        ; 8B:       F30F10670D       MOVSS XMM4, [RDI+13]
        ; 90:       F30F58E1         ADDSS XMM4, XMM1
        ; 94:       F30F104A15       MOVSS XMM1, [RDX+21]
        ; 99:       F30F105715       MOVSS XMM2, [RDI+21]
        ; 9E:       F30F58D1         ADDSS XMM2, XMM1
        ; A2:       49896D40         MOV [R13+64], RBP                ; thread.pseudo-atomic-bits
        ; A6:       498B4520         MOV RAX, [R13+32]                ; thread.alloc-region
        ; AA:       4C8D5820         LEA R11, [RAX+32]
        ; AE:       4D3B5D28         CMP R11, [R13+40]
        ; B2:       7734             JNBE L2
        ; B4:       4D895D20         MOV [R13+32], R11                ; thread.alloc-region
        ; B8: L0:   66C7005903       MOV WORD PTR [RAX], 857
        ; BD:       0C03             OR AL, 3
        ; BF:       49316D40         XOR [R13+64], RBP                ; thread.pseudo-atomic-bits
        ; C3:       7402             JEQ L1
        ; C5:       CC09             BREAK 9                          ; pending interrupt trap
        ; C7: L1:   C7400103024F50   MOV DWORD PTR [RAX+1], #x504F0203  ; #<SB-KERNEL:LAYOUT for VEC {504F0203}>
        ; CE:       F30F115805       MOVSS [RAX+5], XMM3
        ; D3:       F30F11600D       MOVSS [RAX+13], XMM4
        ; D8:       F30F115015       MOVSS [RAX+21], XMM2
        ; DD:       488BD0           MOV RDX, RAX
        ; E0:       488BE5           MOV RSP, RBP
        ; E3:       F8               CLC
        ; E4:       5D               POP RBP
        ; E5:       C3               RET
        ; E6:       CC0F             BREAK 15                         ; Invalid argument count trap
        ; E8: L2:   6A20             PUSH 32
        ; EA:       E8F1C64CFF       CALL #x521001E0                  ; ALLOC-TRAMP
        ; EF:       58               POP RAX
        ; F0:       EBC6             JMP L0
    

The weirdness continues:

    
    
        (defstruct (ray
                     (:conc-name ray-)
                     (:constructor ray-new (origin direction))
                     (:type vector))
          origin direction)
    

The `:conc-name ray-` is useless, that's the default conc-name. And again with
the `:type vector`… just make it a normal struct. I was going to guess that
they were doing it so they could use vector literals to specify the objects,
but then why are they bothering to define a BOA constructor here? And the
slots are untyped, which, if you're looking for speed, is not doing you any
favors.

I took a few minutes over lunch to add some type declarations to the slots and
important functions, inlined the math, cleaned up the broken indentation and
naming issues:

[https://gist.github.com/sjl/005f27274adacd12ea2fc7f0b7200b80...](https://gist.github.com/sjl/005f27274adacd12ea2fc7f0b7200b80/revisions?diff=split#diff-48e2da69300a7d7516647faf76fc0e20)

The old version runs in 5m12s on my laptop, the new version runs in 58s. So if
we unscientifically extrapolate that to their 24m time, it puts it somewhere
around 5m in their list. This matches what I usually see from SBCL: for
numeric-heavy code generic arithmetic is very slow, and some judicious use of
type declarations can get you to within ~5-10x of C. Getting more improvements
beyond that can require really bonkers stuff that often isn't worth it.

~~~
armitron
I did some quick changes to your code (inlining, stack allocating) and got a
further ~2x speedup which makes SBCL performance equivalent to Julia.

~~~
stevelosh
Yeah I considered trying some dynamic-extent declarations but just didn't care
all that much. Can you post your version? I'm curious how far into the
declaration weeds you need to go to get that extra 2x.

EDIT: I'm also curious how much using an optimized vector math library (e.g.
sb-cga) would buy you instead of hand-rolling your own vector math. It would
certainly be _easier_.

------
PorterDuff
It would be interesting to take that C version and hammer on it a bit for
speed.

...and then add SIMD.

------
omaranto
The results are more or less in line with what I would have expected, except
for SBCL and Luajit, which I would have expected to be much faster.

~~~
armitron
The Lisp code is terrible performance-wise and written by someone who
obviously doesn't know Common Lisp very well.

Look at Steve Losh's comment here for something a lot better. My own (further)
improvements put SBCL performance in the same order as Julia.

------
iainmerrick
The most impressive result here is Lua -- not far behind C! LuaJIT is amazing.

Good to see a few languages like Nim and Rust actually beating C for raw
performance, too.

~~~
wlesieutre
Am I reading it wrong or did LuaJIT take 113x as long as C?

~~~
ddlutz
It sure looks that slow to me.

~~~
iainmerrick
Ha, you’re right, I totally misread it! I read it as seconds and not minutes,
doh. :)

