
An example of fast numerical computation using Haskell - lelf
http://www.mit.edu/~mtikekar/posts/stream-fusion.html
======
tsahyt
I've already posted this on /r/haskell on reddit but since the article turned
up here as well, I might just as well post this again. Since we're talking
optimization I had a short look at the C code and found it to be a very naive
implementation and have managed to cut the execution time quite a bit by
employing a sort of memoization. Once you get to a number of the sequence
which you have already seen before, the length from there to 1 remains the
same as before.

So here is the code: [http://lpaste.net/96397](http://lpaste.net/96397)

Here are my measurements:

    
    
        time ./original 1000000
        (329, 837799)
    
        real   0m0.183s
        user   0m0.180s
        sys    0m0.001s
    
        time ./new 1000000
        (329, 837799)
        
        real   0m0.017s
        user   0m0.015s
        sys    0m0.001s

~~~
deletes
The idea is to optimize while keeping the algorithm the same throughout the
languages, at least if you want any meaningful comparison. You are comparing
languages not different algorithms. +1 for effort though.

~~~
anon1385
It doesn't seem that meaningful to compare a totally naive C implementation to
an optimised Haskell one either. (Well, other than to show how fast C is even
without any effort to optimise the code). Saying you have to keep the
algorithm the same seems like a rather arbitrary limitation on the
comparisons, especially since some algorithms are much less of a natural fit
in some languages (see: quicksort in Haskell vs C [1]).

I would suggest it's best to show (at least) 4 versions of the code to compare
two language: the naive implementation in both languages, and the really
optimised version in both languages (plus intermediate levels of optimisation
if you feel dedicated). That gives a picture of how fast the typical code one
would write will be, as well as the potential for optimisation if you are
experiencing bottlenecks, and how much effort it takes to optimise the code.

[1] [http://augustss.blogspot.co.uk/2007/08/quicksort-in-
haskell-...](http://augustss.blogspot.co.uk/2007/08/quicksort-in-haskell-
quicksort-is.html)

~~~
lelf
> _totally naive C implementation to an optimised Haskell one either._

Haskell version is totally naïve. That's the point.

~~~
anon1385
The first version that takes 5.9 seconds is naive. The second one certainly
isn't, and the third one is based on cleaning up the second version with the
use of some library code. As somebody who isn't a Haskell programmer it isn't
clear to me at all that anybody would write the final version of the code if
they weren't trying to optimise it. Maybe they would but the article doesn't
really argue that case and it appears that stream-fusion isn't a part of the
language/standard libraries (you have to install it with cabal).

Once you start modifying code to make it run faster it's no longer a 'naive'
implementation. No matter how nice or logical you think those transformations
are.

I'm not saying you can't compare the stream-fusion version against other
languages, just that it's highly disingenuous to optimise the Haskell code
with some external libraries but then argue that optimisations of the C
version are unfair.

~~~
mietek
If anyone is being disingenuous here, it's you.

The first version [1] is simple and slow. The second version [2] is complex
and fast.

The third, fast version [3] is identical to the _first_ , simple version, with
list processing functions imported from the `stream-fusion` package. This is
the entire point of this article.

The only transformation shown is purely mechanical (prepending symbol names
with the qualified import prefix `S.`), and even this is not required (as
`Prelude` could have been imported `hiding` the appropriate symbol names
instead).

[1]:
[http://www.mit.edu/~mtikekar/posts/src/collatz.hs](http://www.mit.edu/~mtikekar/posts/src/collatz.hs)

[2]:
[http://www.mit.edu/~mtikekar/posts/src/collatz1.hs](http://www.mit.edu/~mtikekar/posts/src/collatz1.hs)

[3]:
[http://www.mit.edu/~mtikekar/posts/src/collatz2.hs](http://www.mit.edu/~mtikekar/posts/src/collatz2.hs)

~~~
vinkelhake
Are there downsides to using the stream-fusion package? If not, why hasn't it
been made the default so that the first naive version is fast?

~~~
mietek
As I understand, stream fusion is still subject of active research. A paper
showing generalised stream fusion in Haskell was submitted to this year's
ICFP.

There is also no benevolent dictator in charge of Haskell.

[http://research.microsoft.com/en-
us/um/people/simonpj/papers...](http://research.microsoft.com/en-
us/um/people/simonpj/papers/ndp/haskell-beats-C.pdf)

[http://www.reddit.com/r/haskell/comments/1br0ls/haskell_beat...](http://www.reddit.com/r/haskell/comments/1br0ls/haskell_beats_c_using_generalised_stream_fusion/)

~~~
carterschonwald
indeed. The generalized stream fusion work is about lifting pointwise
operations to their SIMD vectorized analogue.

Likewise, stream fusion isn't always a win! Stream fusion and its siblings are
great for pointwise operations, but aren't always a good idea for computations
with heavy reuse, like nested convolutions or performant matrix multipy

------
mjn
The main takeaway here is that Haskell's stream fusion library is very nice.
You can write code in the "naive" modular style that generates intermediate
lists, and they get automagically fused away.

There's a little work on that in Lispland as well. Some discussion, a
prototype, and a link to the older SERIES package:
[http://pvk.ca/Blog/Lisp/Pipes/](http://pvk.ca/Blog/Lisp/Pipes/)

~~~
DanWaterworth
Yes, this is an example where purity is really helpful.

------
lindig
Here is an implementation in OCaml. Time taken on a 1.3 GHz MacBook Air:

    
    
        :tmp $ time ./collatz.native 1000000
        start: 837799 length 329
    
        real	0m0.782s
        user	0m0.776s
        sys	0m0.004s
    
        exception Error of string
    
        let (@@) f x  = f x
        let error fmt = Printf.kprintf (fun msg -> raise (Error msg)) fmt
    
        let max (a, x) (b, y) = if x > y then (a, x) else (b, y)
    
        let length start =
            let rec loop len a = 
                let next = if a mod 2 = 0 then a/2 else (3*a+1)/2 in 
                if a != 1 then
                    loop (len+1) next
                else
                    len
            in
                loop 0 start
                    
        let collatz limit =
            let rec loop start longest =
                if start >= limit 
                then longest 
                else loop (start+1) (max longest (start, length start)) 
            in
                loop 1 (1,0)
    
        let report (start, len) = Printf.printf "start: %d length %d\n" start len
    
        let main () =
            let argv        = List.tl @@ Array.to_list Sys.argv in
            match argv with
            | []    -> report @@ collatz 10000
            | [n]   -> report @@ collatz @@ int_of_string n
            | _     -> error "usage: collatz [n]"
    
        let () = main ()

~~~
jds375
As a big OCaml fan, this is pretty cool. The speeds aren't that bad either. I
hope OCaml gets more competitive in terms of running time in the future.

~~~
lindig
The code probably could be made faster by avoiding passing tuples (start,
length) around as these are allocated on the heap. However, I found it a
little bit more abstract.

------
nnq

        Cython	0.47s	less
    

...ignoring the cdefs, it's just as readable or more, at least for someone
used to reading imperative code as opposed to functional code or mathematical
equations. Plus that it takes you 5 minutes to quickly get from a quick and
dirty Python hackoff to Cython. And you can stay blissful and ignorant about
what "stream fusion" is, focusing on the problem you try to solve and not on
the computer science of the tools. Haskell really is far far away from the
"don't make me think" philosophy (or "don't make me think of anything else
besides the real problem that I'm trying to solve"). No wonder Python and
Cython are so loved in the scientific computing field :)

~~~
AndreasFrom
As you acknowledge yourself, whether Haskell makes you think or not relies on
what you're trained to read.

~~~
nnq
Yes, you're right that Haskell is probably easier to read for someone with the
mathematical background. But I'm referring to the cognitive overload incurred
when you try to optimize it, thinking about list creation costs, then stream
fusion to avoid it: most physicists or mathematicians will cringe at the
thought of having to think about these "computer sciency optimization stuff".
As opposed to this, a crude tool such as Cython or C or even Fortran allows
one to just "skip thinking" about these and still have fast enough code.

Haskell is great if you want to expand your CS knowledge, but not if your
math/physics/engineering problem is already so complicated that it uses up all
your available brain cycles and working memory and you just can't squeeze
thinking about language-specific optimizations too. I guess this is why it's
only used by companies like Galois and such that are already neck deep in
advanced CS problems and used to thinking about them all the time.

~~~
PaulAJ
There is a saying: first make it work, then make it work fast. The problem
with doing the second in imperative languages is that you need to pull all
your working code apart and rewrite it. Haskell separates the "make it work"
from the "make it work fast".

------
antimora
I have reimplemented the function in Julia and it took 16.51 seconds to run,
compared to 0.19 seconds when running C version.

Here is the code, perhaps someone else would know how to optimize it.

CODE:

    
    
        function main(max_a0)
           longest = max_len = len = a = 0
           for a0 in 1:max_a0
              a = a0      
              len = 0
              while a != 1
                 len += 1
                 a = ((a%2==0) ? a : 3*a+1)/2
              end
              if len > max_len
                 max_len = len
                 longest = a0
              end
           end
           println("($max_len, $longest)")
        end
    
        @time main(1000000)
    
    

CMD:

    
    
        $ julia collatz.jl 
        (329, 837799)
        elapsed time: 16.436623922 seconds (4206873264 bytes allocated)
    
        $ julia --version
        julia version 0.3.0-prerelease+143

~~~
antimora
WOW, I was able to optimize down to 0.88 seconds by specifying type for a
variable like this a::Int64. It's by a factor of 18.6!

Here is the new code:

    
    
        function main(max_a0)
           longest = max_len = len = a = 0
           for a0 in 1:max_a0
              a = a0      
              len = 0
              while a != 1
                 len += 1
                 a::Int64 = ((a%2==0) ? a : 3*a+1)/2
              end
              if len > max_len
                 max_len = len
                 longest = a0
              end
           end
           println("($max_len, $longest)")
        end
    
        @time main(1000000)
    

CMD:

    
    
        $ julia collatz.jl 
        (329, 837799)
        elapsed time: 0.882312881 seconds (1377276 bytes allocated)

~~~
mhath
You can avoid allocating so much memory by changing

    
    
      a::Int64 = ((a%2==0) ? a : 3*a+1)/2
    

to

    
    
      a = ((a%2 == 0) ? a : 3a+1) >> 1
    

or

    
    
      a = div((a%2 == 0) ? a : 3a+1, 2)
    

for me this gives

    
    
      (329, 837799)
      elapsed time: 0.449473338 seconds (544 bytes allocated)
    

For reference the C version gives

    
    
      ./a.out 1000000  0.40s user 0.00s system 99% cpu 0.407 total

~~~
toolslive
what about: a = ((a & 0x01)? (a*3)+1 : a)>>1;

anyway, when you think about it, you can skip everything between 0 and
max_a0/2 because for any value there there is one with a longer chain (its
double). that's a factor 2 in all languages.

~~~
mhath
Of course halving the range checked would also be an improvement. My own
improvement was meant only to reduce the memory usage of antimora's code.

------
brihat
A naive implementation in Nimrod:

    
    
         proc main(max_a0: int): int =
           var a, longest, len, max_len : int
         
           for a0 in countup(1, max_a0):
             a = a0
             len = 0
         
             while a != 1:
               len += 1
               if (a mod 2 != 0): a = (3*a + 1)
               a = a div 2
         
             if len > max_len:
               max_len = len
               longest = a0
         
           return longest
         
         # Main program starts here
         echo(main(1000000))
    

Takes 0.76s (the C program in TFA takes 0.58s)

------
lelf
Here we go, Data.Vector and the same somewhat flawed comparison as in the
post:

    
    
      next a = Just . join (,) $ (if even a then a else 3*a+1) `div` 2
    
      len = (+1) . V.length . V.takeWhile (/=1) . V.unfoldr next
    
      result = V.maximum . V.map (\x -> (len x, x)) . V.enumFromTo 1
    
      main = do [a0] <- getArgs
                let max_a0 = read a0 :: Word32
                print . result $ max_a0
    

Results: haskell 5.0s, C 4.8s. Good enough I guess?

PS: and this is how you do a proper benchmarking:

    
    
      import Criterion.Main
      main = defaultMain [
              bench "data.vector/1M" $ whnf result 10000000
             ]
    

with result like
[http://lelf.lu/tmp/criterion.html](http://lelf.lu/tmp/criterion.html)

~~~
lelf

      resultQ :: W -> (W,W)
      resultQ n = foldAllS max (0,0) $ runIdentity $ R.computeUnboxedP vec
        where
          fun (Z :. i) = len &&& id $ fromIntegral i
          vec = R.fromFunction (Z :. fromIntegral n) fun :: Array D DIM1 (W,W)
    

Not as straight forward, but [http://lelf.lu/files/fast-num-haskell-
bench-N2.html](http://lelf.lu/files/fast-num-haskell-bench-N2.html) (now you
now number of cores on my macbook I guess).

Not bad for 3 lines of code?

------
killercup
I was bored and curious to know how the Collatz conjecture works.

Then, I implemented it in Lua and JS and compared their performance to C and
Haskell. It's not meant to be a (good) benchmark. Here you go:
[https://gist.github.com/killercup/7718804](https://gist.github.com/killercup/7718804)

tl;dr

    
    
        Implementation | Time 
        -------------- | -----
        clang 3.3      | 0.40s
        luajit 2.0.2   | 0.86s
        GHC 7.6.3      | 0.49s
        Node 0.10.22   | 2.43s

~~~
xdenser
my 2 cents to defend NodeJs

    
    
       329 837799
       naive 3.06 sec
       329 837799
       decompozed 3.03 sec
       329 837799
       memoized decompozed 0.52 sec
    

The program
[https://gist.github.com/xdenser/7722083](https://gist.github.com/xdenser/7722083)

~~~
dbaupp
Presumably the Lua, c and Haskell programs would also be 6× faster if you use
the memoised algorithm, so that's a pretty meaningless comparison.

~~~
xdenser
Ok this intArithm variant (w/o memoization) for NodeJS is 6 times faster (0.54
sec) comparing to naive

    
    
            function intArithm(){
            var
                max_a0 = +process.argv[2],
                longest = 0,
                max_len = 0, a0,len, a,z;
    
            for(a0=1;a0<=max_a0;++a0){
                a = a0;
                len = 0;
                while( a != 1){
                    len++;
                    z = a >>> 1;
                    a = (!(a&1))?z:(a+z+1);
                }
                if(len>max_len){
                    max_len = len;
                    longest = a0;
                }
            }
            console.log(max_len,longest);
            }
    

the problem was with division by 2 operation which is not integer in js. So
replacing it with shift >>> makes the trick. Formula optimization gives
additional 200 ms.

And I doubt memoized c or lua variant will be 6 times faster than non-
memoized. As it is not that linear.

------
wereHamster
Why are the stream fusion functions in a separate module? Why were the
functions in prelude not replaced by these more efficient ones?

~~~
PhineasRex
Partly for historical reasons and partly because Prelude is really intended
for beginners.

~~~
tel
Also for definitional reasons!

~~~
DanWaterworth
Could you elaborate?

~~~
cantankerous
He means that there is a generally understood semantic meaning for the
functions as they exist in the Haskell Prelude. Replacing them under the hood
with something potentially different makes the "meaning" of the basic
functions not quite the same. Haskell has a history rooted in this kind of
thing so it continues in the community now.

~~~
wereHamster
But they are not different. The functions in stream-fusion are a drop-in
replacement for the ones in Data.List. To quote the package description: "To
use, simply import Data.List.Stream in place of Data.List, and hide list
functions from the Prelude."

So to put simply: you have two functions. They are semantically equivalent,
but one is faster than the other. Why not use the faster one?

------
ghshephard
The python equivalent:

    
    
      real	0m21.891s
    

On machine in which the C Variant:

    
    
      real	0m0.570s
    

But - I learned about Cython today, so great article.

Interesting that the Python Version takes so much longer.

    
    
      import sys
    
      check=int(sys.argv[1])
      lnum=1
      lcount=1
    
      for x in range(1,check):
        len=1
        a=x
        while (a>1):
          len+=1
          if a%2==0:
            a=a/2
          else:
            a=(a*3+1)/2
        if len > lcount:
          lcount=len
          lnum=x
    
      print lcount,lnum

------
IsTom
I don't see how C and Cython versions are different. It's almost word-for-word
C.

------
glogla
I made quick Clojure version [1] it's really slow, and probably terrible, but
I wonder if someone more knowledgeable in Clojure could look at it and
optimize it the way it happened with the Haskell version

[1]: [http://pastebin.com/PH6DYbmr](http://pastebin.com/PH6DYbmr)

~~~
cbp
Since you know the division in collatz-next is always exact you can change /
for quot or a bit-shift-right. That should improve it somewhat. You should
also typehint with ^long the arguments to your functions. Further than that I
would probably compile the code and then use a decompiler to check for boxing.
But I can't say I've done much math intensive programming in clojure.

~~~
wging
Nice.

4.5 -> 2.7 seconds on my machine when using 'quot' instead of '/'.

2.4 seconds when hinting ^long in argument of collatz-length.

1.6 when also hinting ^long in collatz-next. Interestingly, when I hint only
in collatz-next it takes 1.5.

------
bcrack
It is interesting to point out that using -O3 optimization drops the C (and
Cython) code exec time by roughly 45%.

------
iskander
Timings on my machine (2.67ghz Xeon W3520):

    
    
        $time ./c_compiled_with_O2 1000000
        real	0m0.362s
        user	0m0.360s
    
        $time ./c_compiled_with_O3 1000000
        real	0m0.189s
        user	0m0.184s
    

I also tried dropping all the cdefs from the Cython code and running the
Python version in Parakeet
([http://www.parakeetpython.com](http://www.parakeetpython.com)) and it took
0.57s (whereas CPython without the @jit decorator took 19.57s).

------
rakoo
Interestingly, this naive go version is faster than the naive C version:

    
    
      package main
    
      import (
        "log"
      )
    
      func main() {
        max := 1000000
    
        var maxa int
        var maxlength int
        for a0 := 0; a0 < max; a0++ {
    
          var length int
          a := a0
    
          for a > 1 {
            if a%2 == 0 {
              a = a/2
            } else {
              a = (3 * a + 1) / 2
            }
            length++
          }
    
          if length > maxlength {
            maxa = a0
            maxlength = length
          }
        }
        log.Println(maxlength, maxa)
    
      }
    

Compiled with go 1.1:

    
    
      > go version
      go version go1.1.2 linux/amd64
    

Result:

    
    
      > time ./collatz
      2013/12/01 13:51:31 329 837799
      ./collatz  0,34s user 0,00s system 99% cpu 0,344 total

~~~
rakoo
Oh, and I just tested this on play.golang.org, and I had wrong results because
_a_ was too small, so we need to manually set it to an uint64:

    
    
      package main
    
      import (
        "log"
      )
    
      func main() {
        max := 1000000
    
        var maxa int
        var maxlength int
        for a0 := 0; a0 < max; a0++ {
    
          var length int
          a := uint64(a0)
    
          for a > 1 {
            if a%2 == 0 {
              a = a/2
            } else {
              a = (3 * a + 1) / 2
            }
            length++
          }
    
          if length > maxlength {
            maxa = a0
            maxlength = length
          }
        }
        log.Println(maxlength, maxa)
    
      }
    

Even more interestingly, this is even faster:

    
    
      > time ./collatz
      2013/12/01 13:59:06 329 837799
      ./collatz  0,27s user 0,00s system 99% cpu 0,276 total

------
lelf
[http://lelf.lu/files/fast-num-haskell-bench.html](http://lelf.lu/files/fast-
num-haskell-bench.html)

Without start-up overhead. C versions via FFI.

[https://gist.github.com/llelf/7721211](https://gist.github.com/llelf/7721211)
(quick'n'shitty, sorry)

------
bjorg
Is it just me or does nobody care about the performance times being so low to
begin with? Wouldn't it be better if the algorithm ran for 10+ seconds at
least? Maybe closer to a minute. That would reduce variance due to operating
system interference, as well as hide the process startup cost, which shouldn't
be relevant here.

------
carterschonwald
Ok. I'm going to cook up a variant that should match the c version in dirty
dirty style Haskell. Will post it after lunch. The list fusion stuff is great
for getting performance out of composing systems, but leaves those last
dangling constant factors vs the ideal tight inner loop.

------
Nican
I just started learning haskell, can someone explain what: (\a0 -> (collatzLen
a0, a0)) is doing?

~~~
wisty
> print $ maximum $ map (\a0 -> (collatzLen a0, a0)) [1..max_a0]

I think the last bit of the line is:

    
    
        for a0 in range(1,max_a0):
            collatzLen(a0,a0)
    

a0 is basically the loop variable.

~~~
evincarofautumn
Function application has high precedence in Haskell, so it would be
(collatzLen(a0), a0).

------
craigyk
what is the preferred python speed-up route these days? I've tried: numpy C
API cython weave numba parakeet

------
xanth
I think the functional languages are always going to be slower than procedural
languages for most task, because they are an abstraction away from the bare
metal. However the ease of maintaining functional code may out way the 19%
(and diminishing) faster procedural code.

~~~
toolslive
C is indeed close to hardware. 1970s hardware that is. That being said, the
rule of thumb for Haskell,OCaml,... is that in terms of performance they are
somewhere between 2 times slower and 2 times faster, depending on details,
strategy, etc. In this particular case, it seems that the runs might be too
short to see a significant difference. btw, try -O3 iso -O2. it makes a
difference...

~~~
bitkrieg
Well, today's hardware is still all Von Neumann, just like in the 70s.

