Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
An example of fast numerical computation using Haskell (mit.edu)
111 points by lelf on Nov 30, 2013 | hide | past | favorite | 76 comments


I've already posted this on /r/haskell on reddit but since the article turned up here as well, I might just as well post this again. Since we're talking optimization I had a short look at the C code and found it to be a very naive implementation and have managed to cut the execution time quite a bit by employing a sort of memoization. Once you get to a number of the sequence which you have already seen before, the length from there to 1 remains the same as before.

So here is the code: http://lpaste.net/96397

Here are my measurements:

    time ./original 1000000
    (329, 837799)

    real   0m0.183s
    user   0m0.180s
    sys    0m0.001s

    time ./new 1000000
    (329, 837799)
    
    real   0m0.017s
    user   0m0.015s
    sys    0m0.001s


The idea is to optimize while keeping the algorithm the same throughout the languages, at least if you want any meaningful comparison. You are comparing languages not different algorithms. +1 for effort though.


It doesn't seem that meaningful to compare a totally naive C implementation to an optimised Haskell one either. (Well, other than to show how fast C is even without any effort to optimise the code). Saying you have to keep the algorithm the same seems like a rather arbitrary limitation on the comparisons, especially since some algorithms are much less of a natural fit in some languages (see: quicksort in Haskell vs C [1]).

I would suggest it's best to show (at least) 4 versions of the code to compare two language: the naive implementation in both languages, and the really optimised version in both languages (plus intermediate levels of optimisation if you feel dedicated). That gives a picture of how fast the typical code one would write will be, as well as the potential for optimisation if you are experiencing bottlenecks, and how much effort it takes to optimise the code.

[1] http://augustss.blogspot.co.uk/2007/08/quicksort-in-haskell-...


> totally naive C implementation to an optimised Haskell one either.

Haskell version is totally naïve. That's the point.


The first version that takes 5.9 seconds is naive. The second one certainly isn't, and the third one is based on cleaning up the second version with the use of some library code. As somebody who isn't a Haskell programmer it isn't clear to me at all that anybody would write the final version of the code if they weren't trying to optimise it. Maybe they would but the article doesn't really argue that case and it appears that stream-fusion isn't a part of the language/standard libraries (you have to install it with cabal).

Once you start modifying code to make it run faster it's no longer a 'naive' implementation. No matter how nice or logical you think those transformations are.

I'm not saying you can't compare the stream-fusion version against other languages, just that it's highly disingenuous to optimise the Haskell code with some external libraries but then argue that optimisations of the C version are unfair.


If anyone is being disingenuous here, it's you.

The first version [1] is simple and slow. The second version [2] is complex and fast.

The third, fast version [3] is identical to the first, simple version, with list processing functions imported from the `stream-fusion` package. This is the entire point of this article.

The only transformation shown is purely mechanical (prepending symbol names with the qualified import prefix `S.`), and even this is not required (as `Prelude` could have been imported `hiding` the appropriate symbol names instead).

[1]: http://www.mit.edu/~mtikekar/posts/src/collatz.hs

[2]: http://www.mit.edu/~mtikekar/posts/src/collatz1.hs

[3]: http://www.mit.edu/~mtikekar/posts/src/collatz2.hs


Are there downsides to using the stream-fusion package? If not, why hasn't it been made the default so that the first naive version is fast?


As I understand, stream fusion is still subject of active research. A paper showing generalised stream fusion in Haskell was submitted to this year's ICFP.

There is also no benevolent dictator in charge of Haskell.

http://research.microsoft.com/en-us/um/people/simonpj/papers...

http://www.reddit.com/r/haskell/comments/1br0ls/haskell_beat...


indeed. The generalized stream fusion work is about lifting pointwise operations to their SIMD vectorized analogue.

Likewise, stream fusion isn't always a win! Stream fusion and its siblings are great for pointwise operations, but aren't always a good idea for computations with heavy reuse, like nested convolutions or performant matrix multipy


Complexity of concatMap operations. Nested things don't always fuse.

If you want to use fusion, the best place is the `vector` package.


I think the idea is that there are many more optimizations that could be performed, but this "naïve" version is still very idiomatic and high level.


Agreed. Optimizing the code without first fixing the asymptotic performance of the algorithm detracts significantly from the article. I've solved this problem before in Project Euler, in both Haskell and C++, and for my purposes as a beginning Haskell programmer, figuring out how to construct an asymptotically efficient algorithm for a Dynamic Programming problem in Haskell is what is interesting here.


I know what you mean but it's early in the morning here and I feel like being a bit pedantic: As far as we can tell neither of the algorithms actually terminates for arbitrary inputs. If we could show that, we'd have proven the Collatz conjecture. Therefore the notion of asymptotic performance doesn't make a lot of sense, since we want to give an upper bound on how long the algorithms takes to terminate. However, if we found such an upper bound, it would be equivalent to solving the problem itself, because the runtime depends on how long a given sequence is, and putting it into a closed form. At that point we'd trivially have an O(1) algorithm.

What we can do, is measure complexity empirically and extrapolate something from there. I haven't done a lot of tests with this code but to me it seemed like it would scale the same as the original one but had a much lower constant.


The main takeaway here is that Haskell's stream fusion library is very nice. You can write code in the "naive" modular style that generates intermediate lists, and they get automagically fused away.

There's a little work on that in Lispland as well. Some discussion, a prototype, and a link to the older SERIES package: http://pvk.ca/Blog/Lisp/Pipes/


Yes, this is an example where purity is really helpful.


Here is an implementation in OCaml. Time taken on a 1.3 GHz MacBook Air:

    :tmp $ time ./collatz.native 1000000
    start: 837799 length 329

    real	0m0.782s
    user	0m0.776s
    sys	0m0.004s

    exception Error of string

    let (@@) f x  = f x
    let error fmt = Printf.kprintf (fun msg -> raise (Error msg)) fmt

    let max (a, x) (b, y) = if x > y then (a, x) else (b, y)

    let length start =
        let rec loop len a = 
            let next = if a mod 2 = 0 then a/2 else (3*a+1)/2 in 
            if a != 1 then
                loop (len+1) next
            else
                len
        in
            loop 0 start
                
    let collatz limit =
        let rec loop start longest =
            if start >= limit 
            then longest 
            else loop (start+1) (max longest (start, length start)) 
        in
            loop 1 (1,0)

    let report (start, len) = Printf.printf "start: %d length %d\n" start len

    let main () =
        let argv        = List.tl @@ Array.to_list Sys.argv in
        match argv with
        | []    -> report @@ collatz 10000
        | [n]   -> report @@ collatz @@ int_of_string n
        | _     -> error "usage: collatz [n]"

    let () = main ()


As a big OCaml fan, this is pretty cool. The speeds aren't that bad either. I hope OCaml gets more competitive in terms of running time in the future.


The code probably could be made faster by avoiding passing tuples (start, length) around as these are allocated on the heap. However, I found it a little bit more abstract.


    Cython	0.47s	less
...ignoring the cdefs, it's just as readable or more, at least for someone used to reading imperative code as opposed to functional code or mathematical equations. Plus that it takes you 5 minutes to quickly get from a quick and dirty Python hackoff to Cython. And you can stay blissful and ignorant about what "stream fusion" is, focusing on the problem you try to solve and not on the computer science of the tools. Haskell really is far far away from the "don't make me think" philosophy (or "don't make me think of anything else besides the real problem that I'm trying to solve"). No wonder Python and Cython are so loved in the scientific computing field :)


This article was less a practical look into how to write fast Haskell code and more a peel-back-the-curtain look. Practically, you'd just use the vector lib [0] which has massive amounts of stream fusion built into it and exposes only very simple immutable and mutable APIs.

[0] http://hackage.haskell.org/package/vector


As you acknowledge yourself, whether Haskell makes you think or not relies on what you're trained to read.


Yes, you're right that Haskell is probably easier to read for someone with the mathematical background. But I'm referring to the cognitive overload incurred when you try to optimize it, thinking about list creation costs, then stream fusion to avoid it: most physicists or mathematicians will cringe at the thought of having to think about these "computer sciency optimization stuff". As opposed to this, a crude tool such as Cython or C or even Fortran allows one to just "skip thinking" about these and still have fast enough code.

Haskell is great if you want to expand your CS knowledge, but not if your math/physics/engineering problem is already so complicated that it uses up all your available brain cycles and working memory and you just can't squeeze thinking about language-specific optimizations too. I guess this is why it's only used by companies like Galois and such that are already neck deep in advanced CS problems and used to thinking about them all the time.


There is a saying: first make it work, then make it work fast. The problem with doing the second in imperative languages is that you need to pull all your working code apart and rewrite it. Haskell separates the "make it work" from the "make it work fast".


Except that no one actually thinks about that stuff when writing actual vector code. Trust me, I have written several real time 3D reconstruction algorithms (such as SLAM) in Haskell and was not thinking about stream fusion at all.


But I'm referring to the cognitive overload incurred when you try to optimize it, thinking about list creation costs, then stream fusion to avoid it

That's not true. This article is about a library that works as a drop-in replacement of the default list processing functions in the Prelude. Eventually, it may be possible that these functions will be put into the Prelude itself. This would allow naive users to get all the performance gains from stream fusion without ever knowing it's happening. All of the "cognitive overload" is taken care of by the researchers who developed this technique and the library authors who implement it.


> Yes, you're right that Haskell is probably easier to read for someone with the mathematical background.

I am quite taken by Haskell, and I have the whopping mathematical background of two introductory courses from my institute of mathematics.

> But I'm referring to the cognitive overload incurred when you try to optimize it, thinking about list creation costs, then stream fusion to avoid i

The author didn't need to know what stream fusion was; all they did was import a library... it might as well have been called import MagicallyOptimizingPrelude and it would have the same effect.


I take issue with this benchmark since comparing it to Haskell is not illuminating since the Cython cdef function is straight C and being called as a foreign function. If you do the same in Haskell via the Haskell C FFI it ends up being roughly the same as the C code ( not surprisingly ).

Having written quite a bit of Cython, I refuse to touch it anymore given how wildly undefined it's behavior can be with intermingling Python and C semantics.


I agree. I think the Cython version is eminently readable.


I have reimplemented the function in Julia and it took 16.51 seconds to run, compared to 0.19 seconds when running C version.

Here is the code, perhaps someone else would know how to optimize it.

CODE:

    function main(max_a0)
       longest = max_len = len = a = 0
       for a0 in 1:max_a0
          a = a0      
          len = 0
          while a != 1
             len += 1
             a = ((a%2==0) ? a : 3*a+1)/2
          end
          if len > max_len
             max_len = len
             longest = a0
          end
       end
       println("($max_len, $longest)")
    end

    @time main(1000000)

CMD:

    $ julia collatz.jl 
    (329, 837799)
    elapsed time: 16.436623922 seconds (4206873264 bytes allocated)

    $ julia --version
    julia version 0.3.0-prerelease+143


WOW, I was able to optimize down to 0.88 seconds by specifying type for a variable like this a::Int64. It's by a factor of 18.6!

Here is the new code:

    function main(max_a0)
       longest = max_len = len = a = 0
       for a0 in 1:max_a0
          a = a0      
          len = 0
          while a != 1
             len += 1
             a::Int64 = ((a%2==0) ? a : 3*a+1)/2
          end
          if len > max_len
             max_len = len
             longest = a0
          end
       end
       println("($max_len, $longest)")
    end

    @time main(1000000)
CMD:

    $ julia collatz.jl 
    (329, 837799)
    elapsed time: 0.882312881 seconds (1377276 bytes allocated)


You can avoid allocating so much memory by changing

  a::Int64 = ((a%2==0) ? a : 3*a+1)/2
to

  a = ((a%2 == 0) ? a : 3a+1) >> 1
or

  a = div((a%2 == 0) ? a : 3a+1, 2)
for me this gives

  (329, 837799)
  elapsed time: 0.449473338 seconds (544 bytes allocated)
For reference the C version gives

  ./a.out 1000000  0.40s user 0.00s system 99% cpu 0.407 total


what about: a = ((a & 0x01)? (a*3)+1 : a)>>1;

anyway, when you think about it, you can skip everything between 0 and max_a0/2 because for any value there there is one with a longer chain (its double). that's a factor 2 in all languages.


Of course halving the range checked would also be an improvement. My own improvement was meant only to reduce the memory usage of antimora's code.


Indeed, the problem here is the type instability in the inner loop. There are a few optimization techniques that we intend to implement to deal with that but as you noticed, we're not quite there yet.


This isn't really a missing optimization, but an understandable mistranslation of the C code – the / operator with integer operands doesn't mean the same thing in Julia as it does in C. In C it means div whereas in Julia it does floating point division.


It's both. Declaring the `::Int64`, doesn't magically turn it into integer division, but it does insert a convert(Int64,...), hence all the speedup here is obtained through type stability, even though the algorithm is still different.


Right, fair enough.


A naive implementation in Nimrod:

     proc main(max_a0: int): int =
       var a, longest, len, max_len : int
     
       for a0 in countup(1, max_a0):
         a = a0
         len = 0
     
         while a != 1:
           len += 1
           if (a mod 2 != 0): a = (3*a + 1)
           a = a div 2
     
         if len > max_len:
           max_len = len
           longest = a0
     
       return longest
     
     # Main program starts here
     echo(main(1000000))
Takes 0.76s (the C program in TFA takes 0.58s)


Here we go, Data.Vector and the same somewhat flawed comparison as in the post:

  next a = Just . join (,) $ (if even a then a else 3*a+1) `div` 2

  len = (+1) . V.length . V.takeWhile (/=1) . V.unfoldr next

  result = V.maximum . V.map (\x -> (len x, x)) . V.enumFromTo 1

  main = do [a0] <- getArgs
            let max_a0 = read a0 :: Word32
            print . result $ max_a0
Results: haskell 5.0s, C 4.8s. Good enough I guess?

PS: and this is how you do a proper benchmarking:

  import Criterion.Main
  main = defaultMain [
          bench "data.vector/1M" $ whnf result 10000000
         ]
with result like http://lelf.lu/tmp/criterion.html


  resultQ :: W -> (W,W)
  resultQ n = foldAllS max (0,0) $ runIdentity $ R.computeUnboxedP vec
    where
      fun (Z :. i) = len &&& id $ fromIntegral i
      vec = R.fromFunction (Z :. fromIntegral n) fun :: Array D DIM1 (W,W)
Not as straight forward, but http://lelf.lu/files/fast-num-haskell-bench-N2.html (now you now number of cores on my macbook I guess).

Not bad for 3 lines of code?


Why are the stream fusion functions in a separate module? Why were the functions in prelude not replaced by these more efficient ones?


Partly for historical reasons and partly because Prelude is really intended for beginners.


Also for definitional reasons!


Could you elaborate?


He means that there is a generally understood semantic meaning for the functions as they exist in the Haskell Prelude. Replacing them under the hood with something potentially different makes the "meaning" of the basic functions not quite the same. Haskell has a history rooted in this kind of thing so it continues in the community now.


But they are not different. The functions in stream-fusion are a drop-in replacement for the ones in Data.List. To quote the package description: "To use, simply import Data.List.Stream in place of Data.List, and hide list functions from the Prelude."

So to put simply: you have two functions. They are semantically equivalent, but one is faster than the other. Why not use the faster one?


Yup! Thanks


I was bored and curious to know how the Collatz conjecture works.

Then, I implemented it in Lua and JS and compared their performance to C and Haskell. It's not meant to be a (good) benchmark. Here you go: https://gist.github.com/killercup/7718804

tl;dr

    Implementation | Time 
    -------------- | -----
    clang 3.3      | 0.40s
    luajit 2.0.2   | 0.86s
    GHC 7.6.3      | 0.49s
    Node 0.10.22   | 2.43s


my 2 cents to defend NodeJs

   329 837799
   naive 3.06 sec
   329 837799
   decompozed 3.03 sec
   329 837799
   memoized decompozed 0.52 sec
The program https://gist.github.com/xdenser/7722083


Presumably the Lua, c and Haskell programs would also be 6× faster if you use the memoised algorithm, so that's a pretty meaningless comparison.


Ok this intArithm variant (w/o memoization) for NodeJS is 6 times faster (0.54 sec) comparing to naive

        function intArithm(){
        var
            max_a0 = +process.argv[2],
            longest = 0,
            max_len = 0, a0,len, a,z;

        for(a0=1;a0<=max_a0;++a0){
            a = a0;
            len = 0;
            while( a != 1){
                len++;
                z = a >>> 1;
                a = (!(a&1))?z:(a+z+1);
            }
            if(len>max_len){
                max_len = len;
                longest = a0;
            }
        }
        console.log(max_len,longest);
        }
the problem was with division by 2 operation which is not integer in js. So replacing it with shift >>> makes the trick. Formula optimization gives additional 200 ms.

And I doubt memoized c or lua variant will be 6 times faster than non-memoized. As it is not that linear.


The python equivalent:

  real	0m21.891s
On machine in which the C Variant:

  real	0m0.570s
But - I learned about Cython today, so great article.

Interesting that the Python Version takes so much longer.

  import sys

  check=int(sys.argv[1])
  lnum=1
  lcount=1

  for x in range(1,check):
    len=1
    a=x
    while (a>1):
      len+=1
      if a%2==0:
        a=a/2
      else:
        a=(a*3+1)/2
    if len > lcount:
      lcount=len
      lnum=x

  print lcount,lnum


I don't see how C and Cython versions are different. It's almost word-for-word C.


It is interesting to point out that using -O3 optimization drops the C (and Cython) code exec time by roughly 45%.


I made quick Clojure version [1] it's really slow, and probably terrible, but I wonder if someone more knowledgeable in Clojure could look at it and optimize it the way it happened with the Haskell version

[1]: http://pastebin.com/PH6DYbmr


Since you know the division in collatz-next is always exact you can change / for quot or a bit-shift-right. That should improve it somewhat. You should also typehint with ^long the arguments to your functions. Further than that I would probably compile the code and then use a decompiler to check for boxing. But I can't say I've done much math intensive programming in clojure.


Nice.

4.5 -> 2.7 seconds on my machine when using 'quot' instead of '/'.

2.4 seconds when hinting ^long in argument of collatz-length.

1.6 when also hinting ^long in collatz-next. Interestingly, when I hint only in collatz-next it takes 1.5.


Thanks! I was completely unaware that such type-hinting existed.

I was also unsure if loop/recur was a good loop paradigm for this kind of problem (maybe for with variable would be better?) but I'd say in 1/3 speed of C it's pretty good for Clojure :)


Timings on my machine (2.67ghz Xeon W3520):

    $time ./c_compiled_with_O2 1000000
    real	0m0.362s
    user	0m0.360s

    $time ./c_compiled_with_O3 1000000
    real	0m0.189s
    user	0m0.184s
I also tried dropping all the cdefs from the Cython code and running the Python version in Parakeet (http://www.parakeetpython.com) and it took 0.57s (whereas CPython without the @jit decorator took 19.57s).


Interestingly, this naive go version is faster than the naive C version:

  package main

  import (
    "log"
  )

  func main() {
    max := 1000000

    var maxa int
    var maxlength int
    for a0 := 0; a0 < max; a0++ {

      var length int
      a := a0

      for a > 1 {
        if a%2 == 0 {
          a = a/2
        } else {
          a = (3 * a + 1) / 2
        }
        length++
      }

      if length > maxlength {
        maxa = a0
        maxlength = length
      }
    }
    log.Println(maxlength, maxa)

  }
Compiled with go 1.1:

  > go version
  go version go1.1.2 linux/amd64
Result:

  > time ./collatz
  2013/12/01 13:51:31 329 837799
  ./collatz  0,34s user 0,00s system 99% cpu 0,344 total


Oh, and I just tested this on play.golang.org, and I had wrong results because a was too small, so we need to manually set it to an uint64:

  package main

  import (
    "log"
  )

  func main() {
    max := 1000000

    var maxa int
    var maxlength int
    for a0 := 0; a0 < max; a0++ {

      var length int
      a := uint64(a0)

      for a > 1 {
        if a%2 == 0 {
          a = a/2
        } else {
          a = (3 * a + 1) / 2
        }
        length++
      }

      if length > maxlength {
        maxa = a0
        maxlength = length
      }
    }
    log.Println(maxlength, maxa)

  }
Even more interestingly, this is even faster:

  > time ./collatz
  2013/12/01 13:59:06 329 837799
  ./collatz  0,27s user 0,00s system 99% cpu 0,276 total


Is it just me or does nobody care about the performance times being so low to begin with? Wouldn't it be better if the algorithm ran for 10+ seconds at least? Maybe closer to a minute. That would reduce variance due to operating system interference, as well as hide the process startup cost, which shouldn't be relevant here.


http://lelf.lu/files/fast-num-haskell-bench.html

Without start-up overhead. C versions via FFI.

https://gist.github.com/llelf/7721211 (quick'n'shitty, sorry)


Ok. I'm going to cook up a variant that should match the c version in dirty dirty style Haskell. Will post it after lunch. The list fusion stuff is great for getting performance out of composing systems, but leaves those last dangling constant factors vs the ideal tight inner loop.


I just started learning haskell, can someone explain what: (\a0 -> (collatzLen a0, a0)) is doing?


That is a function that takes a starting parameter and returns a pair with the length of the sequence as the first element and the starting value as the second element. This is mapped over a list of integers from 1 to max_a0 to find the length of the sequence starting with each integer.

Maximum finds the largest component of a list and compares on the first element of a pair first, so the pair with the longest sequence length as its first element and the starting value that produced that sequence length as it's second element will be returned by `maximum $ map (\a0 -> (collatzLen a0, a0)) [1..max_a0]`.

This function could also be expressed as `collatzLen >>= (,)`, if you were into that sort of thing.


> print $ maximum $ map (\a0 -> (collatzLen a0, a0)) [1..max_a0]

I think the last bit of the line is:

    for a0 in range(1,max_a0):
        collatzLen(a0,a0)
a0 is basically the loop variable.


Function application has high precedence in Haskell, so it would be (collatzLen(a0), a0).


chas explained it very well already, I'd just like to add that this is called a "lambda function" (or lambda expression) or an anonymous function, just in case you want to look it up. There's more on it here: http://www.haskell.org/haskellwiki/Anonymous_function


what is the preferred python speed-up route these days? I've tried: numpy C API cython weave numba parakeet


I think the functional languages are always going to be slower than procedural languages for most task, because they are an abstraction away from the bare metal. However the ease of maintaining functional code may out way the 19% (and diminishing) faster procedural code.


There's nothing requiring functional languages to abstract away from the bare metal. For example, read up on Typed Assembly Language (TAL) -- it's a typed, functional implementation of x86 assembly.

I think the main reason functional languages are currently slower, in general, than procedural languages is that they haven't been popular for long enough yet to get all of the necessary optimizations implemented in the compilers. Think about it -- millions of person-hours have been invested into implementing optimizations in C and Fortran compilers over the past few decades. I expect the performance gap will get shrink steadily as time goes on, and as you said, at some point a cost/benefit threshold is reached where it makes more sense to switch to a functional language even if it's a little slower.


C is indeed close to hardware. 1970s hardware that is. That being said, the rule of thumb for Haskell,OCaml,... is that in terms of performance they are somewhere between 2 times slower and 2 times faster, depending on details, strategy, etc. In this particular case, it seems that the runs might be too short to see a significant difference. btw, try -O3 iso -O2. it makes a difference...


Well, today's hardware is still all Von Neumann, just like in the 70s.


> because they are an abstraction away from the bare metal.

This really is a myth, there's been plenty of work showing you can compile the lambda calculus to abstract machines which map nearly one-to-one on to assembly. High level functional languages make the same performance compromises that high level imperative languages make in performance ( garbage collectors, runtime dispatch, etc ), but there's nothing inherently about slower about compiling functional languages.


Unless bare metal changes to something which matches the functional approach better. Or just simply changes and higher level concepts protect you because you are not tied to implementation details.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: