
Optimizing Ray Tracing in Haskell - jose_zap
https://medium.com/swlh/optimizing-ray-tracing-in-haskell-3dc412fff20a
======
siraben
Optimizing Haskell for performance can be tricky. When I wrote a functional
image library (updated at [0]) I tried to get the frame draw rate down by
using strictness, unpacked records, and so on, but one of the changes that
made the biggest difference came from changing round to truncate[1], I went
from 120 ms/frame to 65 ms/frame! The reason why is because round has quite a
bit of logic associated with it[2] in Haskell's standard library, versus a
simple one-liner for truncate.

It sort of shows that in even simple functions such as this, underlying
implementation details can have an impact on performance.

Also, for some reason filling a C array in Haskell was faster than doing it in
C, probably due to the switching between RTS.

[0] [https://github.com/siraben/functional-
images](https://github.com/siraben/functional-images)

[1]
[https://github.com/sbond75/OpenGLTesting/commit/116384edd6a3...](https://github.com/sbond75/OpenGLTesting/commit/116384edd6a36d7cb828310fcc8bfbf6327cc47d)

[2]
[https://hackage.haskell.org/package/base-4.14.0.0/docs/src/G...](https://hackage.haskell.org/package/base-4.14.0.0/docs/src/GHC.Real.html#truncate)

~~~
ooobit2
I just watched a video on branchless programming, and it blew my mind that the
guy got a 3.36s function to return in .48s by using _more_ code. The idea of
trying to outsmart a compiler or runtime is motivating me to pick back up
coding after 8 years away.

~~~
nickpeterson
I think it’s a sign our chipsets and tooling have become too complicated.
We’re at this weird point where both sides are trying to guess what the other
one is doing.

~~~
mhh__
The Mill (seems to be chugging along if any one is looking for an update) is
an interesting take on the problem.

If you're not familiar its a (in development) exposed pipeline (static
scheduling) VLIW ISA - not unheard of in DSP land but genuinely alien compared
to a CISC scalar processor a la X86 or ARM. I'm not particularly optimistic
about their chances of making into the shops but they haven't failed yet.

~~~
zozbot234
> seems to be chugging along if any one is looking for an update

Cite? I've been looking for _recent_ updates and haven't found any.

~~~
mhh__
[https://millcomputing.com/topic/news/#post-3534](https://millcomputing.com/topic/news/#post-3534)

Not exactly busy but there you are

------
jose_zap
I find this article a nice counterpoint to “it is difficult to reason about
performance in Haskell”. A person who had never programmed in the language,
found all the tools and resources to implement an algorithm from a reference
and make it fast.

~~~
pjmlp
Indeed, the main problem is that many of these beliefs are based in cargo cult
or urban myths, most people that take the effort to actually dive into new
areas with open mind instead of looking for reasons to displace it, come out
with new perspectives in software development.

------
amelius
I certainly didn't expect it to be faster than the Rust implementation. Quite
a surprise there.

~~~
rainbow-road
To be fair, the first sentence on the Rust implementation's GitHub repo is

> A _straightforward_ Rust implementation

(emphasis mine) and the implementation in the article is heavily optimized.

~~~
Athas
I don't think this Haskell implementation is _heavily_ optimised per se. The
only places where readability is sacrificed for performance is the use of
INLINE and the custom tail-recursive function, which is not where most of the
performance improvement comes from. The other improvements (an efficient RNG,
an efficient vector representation, and extract strictness) are something you
get by default in Rust.

I would say that if an experienced Haskell programmer set out to write this
program with the intent of making it fast, they'd never use lists in the first
place, and not consider their absence an optimisation. Haskell lists are
designed for lazily-generated arbitrary-length sequences, and are a bad fit
for representing small finite-length sequences, such as vectors.

------
centimeter
GHC has to walk a tight line between A) applying enough optimization passes to
reliably remove abstraction overhead and B) compiling fast enough that people
won't complain about it.

It's pretty awesome what GHC can pull off in terms of reducing complex
programs to a minimal normal form (you can see this with e.g.
[https://hackage.haskell.org/package/ghc-
proofs-0.1.1/docs/GH...](https://hackage.haskell.org/package/ghc-
proofs-0.1.1/docs/GHC-Proof.html) ), but often times you do have to convince
the optimizer to be more aggressive than it wants to be.

~~~
sukilot
B is "compile fast enough that people will tolerate it".

"Fast enough to not be worth complaining about" is nowhere near anyone's TODO
list.

------
platz
I'm not sure it was his first Haskell program, if it was it's pretty
impressive that he used INLINE, understood minute differences between Random
libraries, profiling, threadscope, etc..

~~~
snawaz
It was indeed my first Haskell program. So I'd take your comment as
compliment. I have been a long time C++ programmer
([https://stackoverflow.com/users/415784/nawaz](https://stackoverflow.com/users/415784/nawaz)),
so I know what kind of tools one should look for in another language when it
comes to making the code performant. So that helped me.

~~~
hopia
Pretty amazing how you were able to use the appropriate tooling to a great
effect right off the bat!

------
muglug
I love how the author provides each optimising commit diff – that really
helped me understand the improvements.

------
ncmncm
The hardest thing about optimizing is knowing whether something is fast. It
helps if you have the another of the same thing that is already fast, or seems
so. When you match it, are you now fast?

You are until somebody makes one faster. Then, you're slow again. That is our
deal with the Devil: our machines are filled with dodgy gimcracks that often
make our program faster, but we can hardly ever know whether it as as fast as
it could be, or if some gimcrack that would make it faster hasn't engaged yet;
or, often, even how to go about engaging it.

Usually we stop optimizing when it seems fast enough, usually because
something else is (still) slower.

It turns out to be pretty simple to make quicksort go twice as fast on current
hardware. Standard libraries and compilers generally don't do those things,
often out of fear of making a few programs slower than before. People complain
a lot more about a slower program than they thank you for making a hundred
others faster--if they even notice the latter.

------
CyberDildonics
It's interesting that half way through the optimizations haskell is doing 3.3
terabytes of total heap allocation.

When it gets to 93 seconds it is doing 208 GB of heap allocation and in the
final version it is still doing over 20 GB of heap allocations. This seems
like quite a bit for a 384x216 image.

~~~
snawaz
That is puzzling to me as well, particularly because `top` and `htop` on my
machine don't even touch `50 MB`. That means, the first figure must be
something else and I don't find a good explanation for that. I came across
people who suggested to ignore that, e.g
[https://stackoverflow.com/questions/61666819/haskell-how-
to-...](https://stackoverflow.com/questions/61666819/haskell-how-to-detect-
lazy-memory-leaks)

If anyone knows the exact reason for that, then please enlighten me as well.
Even this otherwise awesome tutorial ignores that value:
[https://youtu.be/R47959rD2yw?t=1306](https://youtu.be/R47959rD2yw?t=1306)

~~~
sukilot
The runtime allocates and frees heap memory while performing local
calculations.

~~~
snawaz
> The runtime allocates and frees heap memory while performing local
> calculations.

1) If it allocates, why does `top` not show it?

2) Why does it allocate SO MUCH when it uses only a fraction? Looks like a
very bad strategy to allocate 2800 times more memory than actually used?

~~~
CyberDildonics
You are looking at the current amount of memory that the process is using in
top, but your profiling is probably showing the total allocated while the
process ran. This means there are probably LOTS of memory allocations that get
freed immediately instead of the program allocating memory once and reusing
that heap memory. In the case of an image of this size you could probably just
put everything on the stack.

When optimizing, excess heap allocations are the first thing to look at, since
they are usually inside inner loops, which almost always means they are not
necessary. It definitely should not be ignored by someone looking for speed.

~~~
snawaz
> In the case of an image of this size you could probably just put everything
> on the stack.

That sounds interesting and will probably make it more efficient. But how do I
put things on the stack? As per this answer on Stackoverflow, the runtime
allocates memory on the heap to call functions (and probably programmers don't
have any control over that?):

[https://stackoverflow.com/a/5132862/415784](https://stackoverflow.com/a/5132862/415784)

~~~
CyberDildonics
I don't know how to avoid heap memory allocations in haskell or make sure
memory is on the stack, but both are very easy to do in C++.

~~~
snawaz
>I don't know how to avoid heap memory allocations in haskell or make sure
memory is on the stack, but both are very easy to do in C++.

Yup. I know it's easy and the default behavior in C++.

This Haskell page says, it is pretty common to allocate and deallocate memory
immediately, and that GHC handles this pretty efficiently, which probably
explains why Haskellers dont seem to care about this (though it seems weird to
me).

[https://wiki.haskell.org/GHC/Memory_Management](https://wiki.haskell.org/GHC/Memory_Management)

------
logicchains
I wonder if the author would have achieved similar results just by enabling
the -XStrict language extension from the start.

~~~
snawaz
I didn't know that. I'll try that. Thanks for mentioning it.

~~~
hopia
-XStrict is almost never optimal in Haskell as a lot of the optimizations ultimately rely on lazy evaluation. I think more appropriate would be to try out -XStrictData instead.

------
Asooka
That's really cool. The next step would be using explicit SIMD instructions.
Is there a SIMD library for Haskell? Or a library for automatically doing
AoS->SoA transformation on some compute code?

