
Recursive fibonacci benchmark using top languages on GitHub - anonfunction
https://github.com/drujensen/fib
======
drujensen
ok, owner of the repo here. So this project was purely to show the macro
differences between interpreted ruby and compiled crystal to beginner ruby
devs at a meet up.

I decided to add the top languages on github to help give some idea of how
crystal performed against them.

I'm happy to see so much discussion about it and was taken by surprise when my
inbox was full this morning. Thanks @anonfunction. ;-)

The breaking benchmark examples were just that, examples of how to break the
benchmark. I didn't expect to get a memoized version of every language and
really don't think comparing them from a performance benchmark makes much
sense. Let me know if i'm wrong about that.

I am fine adding all of your change requests and will try to keep the
benchmarks up to date.

~~~
petre
I used to benchmark using fibonacci, but the recursive method is just awful
because it does a lot of function calls and you're essentially benchmarking
that. Then I switched to finding primes using the Sieve of Sundaram. It uses
arrays, hashes/maps/dicts and two loops. Also wastes a lot of memory if you
don't split your search domain into ranges. The surprise was that Go and D (in
that order) turned out to be faster than Rust mainly due to Rust's HashMap
SipHash algorithm. I gave up trying to use other hash libraries (SeaHash
specifically) that are not part of the standard lib, because it was quite
frustrating compared to D.

~~~
igouy
> gave up trying to use other hash libraries

 _fwiw_ a Rust k-nucleotide program using indexmap:

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/program/knucleotide-rust-4.html)

------
glandium
So... before looking at the code, I compared the elapsed time between the
fib.rs program compiled with `rustc -O` and `rustc -O -C lto`. This yielded,
interestingly, a ~500ms difference: ~6.45s vs ~6.95s (with LTO winning).

Then I looked at the code, and it began to make no sense. Then I looked at the
assembly, and it made even less sense: the main function, fib::fib, which is
where the vast majority of the time is spent, is identical, except for
addresses.

I was starting to think, well, this might be related to instruction-cache
lines... and it looks like it's not that:

    
    
      $ perf stat ./fib-lto
      2971215073
    
       Performance counter stats for './fib':
    
             6474.410894      task-clock (msec)         #    1.000 CPUs utilized          
                      17      context-switches          #    0.003 K/sec                  
                       0      cpu-migrations            #    0.000 K/sec                  
                     112      page-faults               #    0.017 K/sec                  
          22,607,598,238      cycles                    #    3.492 GHz                    
          55,325,512,417      instructions              #    2.45  insn per cycle         
          13,021,149,180      branches                  # 2011.171 M/sec                  
              76,737,179      branch-misses             #    0.59% of all branches        
    
             6.474991613 seconds time elapsed
    
             6.474816000 seconds user
             0.000000000 seconds sys
    
    
      $ perf stat ./fib
      2971215073
    
       Performance counter stats for './fib':
    
             6956.534790      task-clock (msec)         #    1.000 CPUs utilized          
                      11      context-switches          #    0.002 K/sec                  
                       0      cpu-migrations            #    0.000 K/sec                  
                     113      page-faults               #    0.016 K/sec                  
          24,290,924,647      cycles                    #    3.492 GHz                    
          55,325,864,841      instructions              #    2.28  insn per cycle         
          13,021,213,404      branches                  # 1871.796 M/sec                  
              84,392,454      branch-misses             #    0.65% of all branches        
    
             6.957260487 seconds time elapsed
    
             6.956974000 seconds user
             0.000000000 seconds sys
    
    

I didn't know an address difference of 0x30 could influence the number of
branch misses so much.

~~~
glandium
Interestingly, the C and C++ versions have different performance for the same
reason: they produce the exact same machine code, at different addresses. Thus
one runs faster than the other. _Edit: actually, there isn 't a difference in
branch-misses for those... only in cycles, which is even more intriguing._

Another interesting fact: valgrind's branch simulator doesn't show a
difference in branch mispredictions between both (it also shows a
misprediction rate much larger than reality, it's likely based on an old model
_Edit: valgrind manual says: Cachegrind simulates branch predictors intended
to be typical of mainstream desktop /server processors of around 2004._).

 _More edit_ : I hacked a linker script to place the fib function from the C++
implementation at a fixed address, and tried different addresses, with
interesting results:

    
    
      0x10000: 5.117084095 seconds time elapsed, 17,866,364,962 cycles
      0x10010: 5.206242916 seconds time elapsed, 18,090,712,907 cycles
      0x10020: 6.096635146 seconds time elapsed, 21,285,484,583 cycles
      0x10030: 4.955020420 seconds time elapsed, 17,297,162,836 cycles
      0x10040: 5.146043048 seconds time elapsed, 17,954,722,919 cycles
      0x10050: 5.252477508 seconds time elapsed, 18,335,804,193 cycles
      0x10060: 6.100806292 seconds time elapsed, 21,300,089,284 cycles
      0x10070: 4.936397948 seconds time elapsed, 17,216,051,020 cycles
    

_Even more edit_ : I vaguely remember there was someone doing some analysis
(with performance counters) of some similar performance difference depending
where the function was located, that was on HN a few months ago, but I can't
find it anymore.

~~~
jcranmer
An Intel engineer gave a presentation at the 2016 LLVM Developers' Meeting
that gives some insight: [http://llvm.org/devmtg/2016-11/Slides/Ansari-Code-
Alignment....](http://llvm.org/devmtg/2016-11/Slides/Ansari-Code-
Alignment.pdf) (for the slides).

Essentially, some of the loop heuristics work on a 32-byte window, not a
16-byte window. An interesting statistic would be to find where all the loops
are in the code as relative offsets, and find how these shift based on
starting alignment.

~~~
glandium
Thanks. This looks familiar, but I thought I had seen something similar in a
blog form. _Edit_ : Found it!
[https://dendibakh.github.io/blog/2018/01/18/Code_alignment_i...](https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues)
[https://news.ycombinator.com/item?id=16198171](https://news.ycombinator.com/item?id=16198171)

FWIW, this is what the function looks like at 0x10070:

    
    
      0000000000010070 <_Z3fibl>:
       10070:       48 83 ff 01             cmp    $0x1,%rdi
       10074:       7e 32                   jle    100a8 <_Z3fibl+0x38>
       10076:       55                      push   %rbp
       10077:       31 ed                   xor    %ebp,%ebp
       10079:       53                      push   %rbx
       1007a:       48 89 fb                mov    %rdi,%rbx
       1007d:       48 83 ec 08             sub    $0x8,%rsp
       10081:       48 8d 7b ff             lea    -0x1(%rbx),%rdi
       10085:       48 83 eb 02             sub    $0x2,%rbx
       10089:       e8 e2 ff ff ff          callq  10070 <_Z3fibl>
       1008e:       48 01 c5                add    %rax,%rbp
       10091:       48 83 fb 01             cmp    $0x1,%rbx
       10095:       7f ea                   jg     10081 <_Z3fibl+0x11>
       10097:       48 83 c4 08             add    $0x8,%rsp
       1009b:       48 8d 45 01             lea    0x1(%rbp),%rax
       1009f:       5b                      pop    %rbx
       100a0:       5d                      pop    %rbp
       100a1:       c3                      retq   
       100a2:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
       100a8:       b8 01 00 00 00          mov    $0x1,%eax
       100ad:       c3                      retq
    

Edit: It's interesting to note that there's only one recursive call to the
function, not two (you get two calls if you compile with -O1). It's also
interesting to note that -Os generates an even smaller version with only one
recursion, but that ends up slower (~6.5s).

~~~
BeeOnRope
The compiler has determined that the second call `fib(n - 2)` is "almost"
tail-recursive (there is probably a better term for this), and so was able to
convert that call into the loop you see at 10081 to 10095. This loop
repeatedly calls the `fib(n - 1)` part.

In other words, the function was transformed from:

    
    
      uint64_t fib(uint64_t n) {
        if (n <= 1) return 1;
        return fib(n - 1) + fib(n - 2);
      }
    

into the equivalent of:

    
    
      uint64_t fib2(uint64_t n) {
        if (n <= 1) return 1;
        uint64_t sum = 1;
        for (; n > 1; n -= 2) {
          sum += fib2(n - 1);
        }
        return sum;
      }
    

Unfortunately, the result isn't too great as the initial part of the function
is pretty heavy, and this bottlenecks computation. It would be better to
inline some copies of fib into itself which might even allow some small amount
of CSE, but importantly would reduce the number of calls.

The alignment you show is "special" (fast) because it's the point at which the
top of the main loop (10081) is near the start of a 32-byte region, which
allows multiple uops to be efficiently dispatched after jumping there.

------
rbjorklin
The owner of this repo has no idea it’s trending on HN. This issue gave me a
good laugh:
[https://github.com/drujensen/fib/issues/51](https://github.com/drujensen/fib/issues/51)

~~~
avip
On the flip side, HN just won itself a new user.

~~~
drujensen
lol, yes it did.

~~~
vilos
Your project is informative and awesome, I hope you don't mind the attention.

------
gambler
Sorry, but this is _not_ how you do numeric benchmarks. Most real programs
that deal with numbers also manipulate data structures and call many standard
library methods. If those things are slow, your app is slow. This benchmark
doesn't demonstrate anything that would translate into real-life performance.

~~~
alangpierce
It's definitely not a realistic measure of overall language performance, but I
think it certainly has value if taken with the right caveats. It's probably a
reasonably good measure of function call overhead in different languages, and
I think it's good for getting a better intuition about the order of magnitude
difference to expect from, say, C vs JS (~3x slower than C in simple cases) vs
Python (~100x slower than C).

~~~
jcranmer
> It's probably a reasonably good measure of function call overhead in
> different languages

It's actually not. Fibonacci has a near-tail call in it that can convert some
of the recursion into a loop. Furthermore, the runtime is dominated by useless
recomputation that can be handled by memoization. So the distinction between
languages is going to be dominated by their ability to do some moderately
complex optimizations rather than by any intrinsic performance characteristics
of their implementation.

Moreover, because the code is so small, there is likely to be major side
effects as a result of effectively random differences--consider the effects of
the code placement that cause a spurious difference between C and C++ kernels
of about 20%. The JS version is going to get hit with a deoptimization at the
very end (it might not impact performance) because the final result does not
fit in an int32_t, and suddenly fib is no longer a well-typed function.

Microbenchmarking is _hard_, and it is all too easy for the microbenchmark to
cease measuring the things that you want to measure.

~~~
Jtsummers
Fibonacci (as implemented here) shouldn't be aided by any TCO. The execution
trace will be something like:

    
    
      F(N) -- F(N-2) ...
       |
      F(N-1) -- F(N-3) ...
       |
      F(N-2) -- F(N-4) ...
    

But each has to return to the original caller because a summation still has to
happen. The actual tail call is to the +/2 function.

F(N) can't complete until F(N-1) and F(N-2) have completed so it can sum up
the values. If you pass the earlier computation, F(N-1), down to F(N-2) as a
second parameter, you could get a tail call on the last part.

    
    
      fib(0, Val) -> Val + 1;
      fib(1, Val) -> Val + 1;
      fib(N, Val) ->
        T = fib(N-1,Val),
        fib(N-2,T).
    

(I think I wrote that correctly, can't test here.)

But that's a non-obvious tranformation. I really would like to see the
compiler that recognized that this was a valid (computational) equivalent to
the original.

~~~
jcranmer
But it is aided by tail-call optimization. Suppose you have a function as
follows:

    
    
       int foo(int x) {
         if (test(x)) return C;
         return evaluate(x) + foo(adjust(x));
       }
    

I can turn that function into:

    
    
       int foo(int x) {
         int accum = C;
         for ( ; !test(x); x = adjust(x)) {
           accum += evaluate(x);
         }
         return accum;
       }
    

In the case of fibonacci, that equates to:

    
    
       int fib(int x) {
         int accum = 1;
         for (; x > 1; x -= 2) {
           accum += fib(x - 1);
         }
         return accum;
       }
    

We still turned the recursive call (or one of them, at least), into a while
loop.

And this idiom is recognized by both gcc and llvm. In fact. llvm has a test
that specifically makes sure fib is transformed as thus:
[https://github.com/llvm-
mirror/llvm/blob/30fa583f8430bfc7935...](https://github.com/llvm-
mirror/llvm/blob/30fa583f8430bfc7935bf66b1aab50cfefd6cc11/test/Transforms/TailCallElim/accum_recursion.ll)

------
feintruled
I was intrigued to learn more about the winning language Nim to see how it
beats C/C++, so I did a web search and was confounded to see Nim compiles to
C/C++! What gives? Starting to doubt the methodology somewhat, though it was
an interesting read.

~~~
fabriceleal
Probably more about the difference between "echo" and "std::cout << ... <<
std::endl;" than the recursive function call.

~~~
Sean1708
I would be _very very_ surprised if printing "2971215073" accounted for more
than 1 second of runtime.

~~~
gambler
Well, I know that printing stuff to console can be ridiculously slow, so this
doesn't surprise me much.

~~~
Sean1708
An entire second to print 10 characters?! How does a terminal that slow even
exist?

------
rothron
Don't see the point really.

Languages without tail recursion will perform worse. Memoization is borderline
cheating, because it's a different implementation.

Fib is the poster boy for tail recursion but the reason for that is that
recursion to implement fib is simply a bad choice. It's cute but that's about
it. If the point is to measure function call overhead, then measure _that_?

~~~
rothron
Just to underline the futility of this, if you are clever you can let the
compiler do it.

[https://everything2.com/title/C%252B%252B%253A+computing+Fib...](https://everything2.com/title/C%252B%252B%253A+computing+Fibonacci+numbers+at+compile+time)

~~~
eesmith
The benchmark includes a C++ constexpr version, at
[https://github.com/drujensen/fib/blob/master/fib-
constexpr.c...](https://github.com/drujensen/fib/blob/master/fib-
constexpr.cpp) , with timing numbers under the section "Optimized code that
breaks the benchmark" and the comment "all benchmarks will have some caveat."

------
leni536
It would be interesting to include skip[1] as it has language level
memoization. There was a recent HN discussion about the language [2]. I do not
think that the naive recursive Fibonacci is a useful benchmark in any way
though, it's a way too pessimized implementation.

[1] [http://www.skiplang.com/](http://www.skiplang.com/) [2]
[https://news.ycombinator.com/item?id=18077612](https://news.ycombinator.com/item?id=18077612)

~~~
gnulinux
Why do we need language level memorization again? In python memoizing a
function is just one line `@lru_cache()` I can't see why this needs an
improvement by adding yet another thing to the core language.

~~~
thaunatos
Just adding a @lru_cache() to a python isn't guaranteed to be correct; for
example, if you're reading from an API the response could change between
calls.

Skip tracks side effects, and will either (a) memoize automatically a pure
function or (b) recognize that a function is impure and avoid the memoization.

------
freecodyx
During the my last technical interview, they asked my to write a fibonacci
implementation in golang, i wrote the code, then they asked me to test it for
like fb(180), i was surprised how slow it, then i was asked to come up with an
optimisation in order to make it fast, i come exactly with same fib-mem.go
provided in this benchmark, i was hired !

~~~
bufferoverflow
fib(180) without caching would take billions of years to complete, so "slow"
is an understatement.

~~~
bjoli
Or you write an iterative version of it that will not only be the simplest
solution, it will be fast enough to compute fib(10000). There are constant
time ones IIRC, but if someone comes up with such a solution in an interview
without knowing it from before, they are probably over qualified for any job
that uses the Fibonacci sequence as an interview question.

------
Gondolin
Any matrix implementation (with fast exponentiation) is going to be much
faster, even in a "slow" language. For instance, in ruby, `Matrix[[1,1],[1,0]]
* * 46` is instantaneous (real time: 0.000131s).

We have to go up to the ten million'th term to get something that is not
instantaneous (0.486s) for a result which has 2089876 digits.

~~~
uryga
Are you sure `Matrix[[1,1],[1,0]] * * 46` isn't optimized into a constant
value by the bytecode compiler? Idk Ruby much, but it seems like a possiblity.

------
sebazzz
Nice to see the different implementations side-by-side. I do believe however
that this is mainly a benchmark of startup time and initial JIT of the
CRT/runtime/virtual machine/execution environment.

If you remove that time from the equation, I expect that the execution time
will not differ a lot from each other.

~~~
chvid
Of all the benchmarks out there this is actually reasonable (though obviously
limited). It runs the same thing in different languages and the processing
time is long enough for VM startup not to matter.

Also the performance shows the bytecode languages C# and Java to do fairly
well compared to C and GO.

~~~
imtringued
That has mostly to do with the fact that the code isn't manipulating objects
at all. As soon as you're starting pointer chasing in languages where almost
everything is an object like javascript performance will suffer.

------
lispm
SBCL on my 2012 Mac mini:

    
    
        rjmacmini:~$ sbcl
        This is SBCL 1.4.2, an implementation of ANSI Common Lisp.
        More information about SBCL is available at <http://www.sbcl.org/>.
    
        SBCL is free software, provided as is, with absolutely no warranty.
        It is mostly in the public domain; some portions are provided under
        BSD-style licenses.  See the CREDITS and COPYING files in the
        distribution for more information.
        * (defun fib (n)                        
            (declare (fixnum n)
                     (optimize (speed 3) (debug 0) (safety 0)))
            (if (<= n 1)
                1
              (the fixnum
                   (+ (fib (- n 1))
                      (fib (- n 2))))))
    
        FIB
        * (time (fib 46))
    
        Evaluation took:
          14.957 seconds of real time
          14.947616 seconds of total run time (14.934818 user, 0.012798 system)
          99.94% CPU
          38,799,794,836 processor cycles
          0 bytes consed
    
        2971215073
        * (SAVE-LISP-AND-DIE "/tmp/fiblisp" :toplevel (lambda (&rest args) (print (fib 46))) :executable t)
        ; in: SAVE-LISP-AND-DIE "/tmp/fiblisp"
        ;     (LAMBDA (&REST ARGS) (PRINT (FIB 46)))
        ; 
        ; caught STYLE-WARNING:
        ;   The variable ARGS is defined but never used.
        ; 
        ; compilation unit finished
        ;   caught 1 STYLE-WARNING condition
        [undoing binding stack and other enclosing state... done]
        [defragmenting immobile space... 643+15263+734+344+25029+16756 objects... done]
        [saving current Lisp image into /tmp/fiblisp:
        writing 0 bytes from the read-only space at 0x20000000
        writing 848 bytes from the static space at 0x20100000
        writing 1863680 bytes from the immobile space at 0x20300000
        writing 11472480 bytes from the immobile space at 0x21b00000
        writing 26542080 bytes from the dynamic space at 0x1000000000
        done]
        rjmacmini:~$ time /tmp/fiblisp
    
        2971215073 
        real	0m14.785s
        user	0m14.755s
        sys	0m0.020s
        rjmacmini:~$

~~~
stevelosh
I don't think adding debug/safety 0 are really worth it. In this:

    
    
        (declaim (optimize speed)
                 (ftype (function (fixnum) fixnum) fib))
    
        (defun fib (n)
          (if (<= n 1)
            1
            (+ (fib (- n 1))
               (fib (- n 2)))))
    
        (print (fib 46))
    

adding `(safety 0) (debug 0)` took the time from 13.17s (with just the `speed`
declaration) down to 12.94s for me. Is a 2% speed increase really worth the
danger of `(safety 0)`?

~~~
lispm
Generally depends on the implementation and the default behavior.

But anyway, it's a good point.

------
jonenst
They should

\- remove the memoized versions (obviously it's faster)

\- show the executed instructions for each language of the main loop (which
would be a nice exercise with all the vms and dynamic languages)

------
ainar-g
Re. Go.

I've turned this code into a Go benchmark and run it with both the default
cmd/compile (1.11) and gccgo (8.2.0).

    
    
      cmd/compile:
      BenchmarkFib-4   	       1	16530409619 ns/op
      PASS
      ok  	command-line-arguments	16.533s
      
      gccgo:
      BenchmarkFib-4   	2	 776368644 ns/op
      PASS
      ok  	command-line-arguments	2.381s
    

So yeah, if you are doing heavy-weight math stuff in Go, you might consider
switching to gccgo. You might get up to 20x performance boost.

Here is the gist, please tell me if I screwed up somewhere:
[https://gist.github.com/ainar-g/1bd363d41c441d9ebf05c0c0b9f2...](https://gist.github.com/ainar-g/1bd363d41c441d9ebf05c0c0b9f2a529).

EDIT: After some disassembly and experimenting, it seems like what we see here
is some clever unrolling and memoisation. If I change 46 from a constant to a
variable, it becomes twice as slow. Plus there is this in disasm:

    
    
      /tmp/go/fibbench_test.go:13
              return fib(n-1) + fib(n-2)
          30b9:       e8 72 ff ff ff          callq  3030 <command_line_arguments.fib>
          30be:       bf 25 00 00 00          mov    $0x25,%edi
          30c3:       e8 68 ff ff ff          callq  3030 <command_line_arguments.fib>
          30c8:       bf 24 00 00 00          mov    $0x24,%edi
          30cd:       e8 5e ff ff ff          callq  3030 <command_line_arguments.fib>
          30d2:       bf 23 00 00 00          mov    $0x23,%edi
      ...
    

What puzzles me is why doesn't gcc do this for the C version. Even if I add an
explicit __attribute__((const)).

~~~
FabHK
So you're saying it would run in less than 1/4 of the time of the Nim, C, C++
implementations? Sure it's running the same algo?

~~~
ainar-g
I am as confused as you are. The C version is much slower on my PC. I've added
the link to the code above. Please tell me if you see something fishy.

------
joserr
SBCL on my AMD FX(tm)-8350 Eight-Core Processor:

    
    
       ~$ sbcl
       This is SBCL 1.4.11, an implementation of ANSI Common Lisp.
       More information about SBCL is available at <http://www.sbcl.org/>.
    
       SBCL is free software, provided as is, with absolutely no warranty.
       It is mostly in the public domain; some portions are provided under
       BSD-style licenses.  See the CREDITS and COPYING files in the
       distribution for more information.
       * (defun fibonacci-tail-recursive ( n &optional (a 1) (b 1))
               (declare (optimize (speed 3) (safety 0) (debug 0))
                        (type fixnum n a b))
               (if (< n 1)
                   a
                   (fibonacci-tail-recursive (- n 1) b (+ a b))))
    
       FIBONACCI-TAIL-RECURSIVE
       * (time (fibonacci-tail-recursive 46))
    
       Evaluation took:
         0.000 seconds of real time
         0.000001 seconds of total run time (0.000001 user, 0.000000 system)
         100.00% CPU
         2,513 processor cycles
         0 bytes consed
      
         2971215073
       *

~~~
codr4
Different machine and superior algorithm, where are you aiming with this?

~~~
joserr
It's true, I only read recursive Fibonacci. Now I see my mistake.

~~~
codr4
That being said, this one [0] gives correct results and runs slightly faster
:)

[0]
[https://gist.github.com/codr4life/59f2c02403b27d551e706f673b...](https://gist.github.com/codr4life/59f2c02403b27d551e706f673bc50f15)

------
ellisv
Several implementations are wrong -- they don't give the correct result.

~~~
hurrrrr
They all seem to be consistently wrong. The series is shifted by 1.

------
celrod
Because Fortran was listed as x.xxx, I figured I'd try it. Taking the minimum
of 100 runs of gcc/gfortran, 200 runs of g++, and 4 runs of Julia 1.1-dev:

gcc: 3.4866

g++: 3.4428

gfortran: 3.4953

Julia: 8.0378

I was doing other things while the benchmarks ran, which added noise. For the
first run of g++, the mean was slightly higher than C's mean, while the
standard deviation was much higher than C or Fortran's. So I reran it for C++,
and then decided to just report the minimums for everything instead, because
the minimum is probably the least biased measure _.

_ So long as there are no allocations that occasionally get cleaned up by a
garbage collector. If there were, the GC cost should get amortized over all
the runs that contributed to it. We don't have to worry about this here.

While the C and C++ assembly for the fib function was the same, Fortran's was
different.

Julia's assembly was extremely brief in comparison, because it doesn't have
any recursion optimizations -- it is just the check and then two calls to
itself.

~~~
ChrisRackauckas
Yes, Julia doesn't do tail call optimizations (TCO) which it could do with
LLVM. Just not enabled yet. I find it funny when people try to say that
Julia's website benchmarks are cherrypicked to look good when the first
example shows that it's tracking an optimization which it's missing...

------
bestboy
One can improve the JAVA result by allowing the JIT compiler to inline the
recursive calls.

    
    
       -XX:MaxRecursiveInlineLevel=1 (default): 6.667 s
       -XX:MaxRecursiveInlineLevel=2:           6.141 s
       -XX:MaxRecursiveInlineLevel=3:           5.768 s
       -XX:MaxRecursiveInlineLevel=4:           5.400 s
       -XX:MaxRecursiveInlineLevel=5:           5.361 s
       -XX:MaxRecursiveInlineLevel=6:           5.072 s
    

For comparison

    
    
       fib.c with -O3: 3.764 s
    

edit: formatting and adding more data points

------
iainmerrick
What’s the point of even mentioning the memoized versions? OF COURSE the naive
recursive version is slow and it’s very easy to write something much faster in
any language whatsoever.

------
xenadu02
$ swiftc -Ounchecked -g fib.swift $ time ./fib 2971215073

real 0m6.341s user 0m6.297s sys 0m0.023s

Using unchecked mode is closer to what C is doing (skipping bounds and
overflow checks).

------
codr4
I would submit an entry for my own project, Snabl [0]; but it intentionally
doesn't support the algorithm since naive recursion beyond a few levels
doesn't make much sense in practice. I added a separate keyword to force tail
calls since that's usually what you want.

[0]
[https://github.com/codr4life/snabl#function](https://github.com/codr4life/snabl#function)

------
FabHK
A few observations:

\- on my machine, C++ is slower than C.

\- I thought that in Julia, startup and compile time could be a factor, but
they're pretty negligible (maybe 1 to 5% or so of the runtime, for Julia 0.7).

\- Without the -O3 switch, the runtime more or less doubles for C and C++,
making these languages slower than Swift, Go, Java, Dart, Julia, etc. That
surprised me.

~~~
olzd
The -O3 switch removes a recursive call, among other things
([https://godbolt.org/z/oS3Cju](https://godbolt.org/z/oS3Cju)).

~~~
FabHK
Interesting. (Awesome compiler explorer website, btw).

The Julia version (whose runtime is halfway between the C versions with vs
without -O3) contains both recursive calls, FWIW.

------
decentralised
I decided to try this challenge using Solidity and made a little writeup
comparing four implementations.

[https://medium.com/@jpa_of_snc/fibonacci-in-
solidity-8477d90...](https://medium.com/@jpa_of_snc/fibonacci-in-
solidity-8477d907e22a)

------
terancet
As for me, these benchmarks are strange because there are solutions that are
not omptimized -- in node.js implementation the memoized approach has been
used (which outperforms the recursion), but in Java solution (as an example)
-- the recursive approach has been implemented.

~~~
thedufer
I think the intent is that the main implementations are those in files like
`[Ff]ib.*`. There are additional memoized implementations for some languages,
but presumably these are only used in the "Optimized code that breaks the
benchmark" section.

------
acangiano
Elixir can be made much faster via tail call optimization.

Something like:

    
    
      def fibonacci(n) when n >= 0, do: fib(0, 1, n)
    
      defp fib(a, b, n) do
        case n do
          0 ->
            a
    
          _ ->
            fib(b, a + b, n - 1)
        end
      end

~~~
brightball
There's a PR there to do just that. I don't follow the reasoning that it's not
been accepted. It's also using Elixir script instead of precompiled Elixir.

[https://github.com/drujensen/fib/pull/33](https://github.com/drujensen/fib/pull/33)

~~~
acangiano
Yeah, I don't think this breaks any of the rules. We are still using
recursion. Just leveraging what the language has to offer.

~~~
alangpierce
The algorithm being benchmarked is the naive double-recursive fibonacci
algorithm, which ends up performing over a billion addition operations in this
case. The tail-recursive function above is the iterative bottom-up fibonacci
algorithm that does it in about 50 addition operations. It's true that both
use recursion, but the second is definitely not the tail-recursive version of
the first.

It is possible to do a form of tail call optimization for the double-recursive
algorithm, described here:
[https://news.ycombinator.com/item?id=18096804](https://news.ycombinator.com/item?id=18096804)

This reduces the number of total function calls but not the number of
additions (since it's not changing the algorithm, just the way the function
invocation strategy when running the algorithm).

------
edoo
Something isn't quite right. Nim 'compiles' into C code ready for compilation.
It should not be faster than the raw C and C++ implementation. Frankly even
the C/C++ implementation shouldn't be so different.

------
sambe
I think Rust also does not have tail call optimisation. Nim and Crystal I'd
heard why relying on underlying compilers, and could sometimes fail to get it
(perhaps more than you'd expect using a compiler more "directly"?).

------
bjourne
Sure but which languages are still there to calculate fib(100) and which
bailed out?

------
jomoga
For Haskell try:

\-- Memorized variant is near instant even after 10000 memoized_fib :: Int ->
Integer memoized_fib = (map fib [0 ..] !!) where fib 0 = 0 fib 1 = 1 fib n =
memoized_fib (n-2) + memoized_fib (n-1)

or

fibM = \n -> values !! n where values = [fibAux m | m <\- [0..]] fibAux n | n
<= 1 = n | otherwise = fibM (n-2) + fibM (n-1)

or

fibM2 :: Int -> Integer fibM2 = \n -> values !! n where values = [fibAux m | m
<\- [0..]] fibAux 0 = 0 fibAux 1 = 1 fibAux n = fibM2 (n-2) + fibM2 (n-1)

Run the following at the ghci Haskell prompt:

memoized_fib 47

fibM 47

fibM2 47

If you want to wait try:

\-- Traditional implementation of fibonacci, hangs after about 30 slow_fib ::
Int -> Integer slow_fib 0 = 0 slow_fib 1 = 1 slow_fib n = slow_fib (n-2) +
slow_fib (n-1)

------
serichsen
The sport here seems to be which compiler manages to re-write the bad code
best.

Tail call optimization is only marginally relevant here, since the tail call
is the addition, not one of the recursions.

------
midgetjones
I'd be interested to see how Elixir does with an OTP-optimised redo. Probably
still not great, but I feel like the example isn't playing to its strengths.

~~~
obahareth
The example was indeed not playing to its strengths, it wasn't making use of
tail call optimization.

I made a PR to fix that:
[https://github.com/drujensen/fib/pull/33](https://github.com/drujensen/fib/pull/33)

~~~
thedufer
That's obviously a better choice if you actually want Fibonacci numbers, but
it defeats the point of a benchmark if you implement a completely different
algorithm in one of the languages.

------
AndyKelley
Does Zig get a mention?
[https://godbolt.org/z/G0uvIU](https://godbolt.org/z/G0uvIU)

------
alangpierce
It would be great to see webassembly on here as well! (Maybe hand-written and
compiled from a few different languages.)

------
bcherny
Since we’re talking about recursion, it would be neat to add benchmarks for
languages that encourage recursion over iteration (not top 10, but could be
fun):

\- Haskell

\- OCaml

\- Scala

\- F#

\- Clojure

\- Common Lisp

~~~
rbjorklin
Haskell and OCaml have already been merged :)

------
BooneJS
A pretty fast way in Haskell/ghci[0] is:

λ: fibs = 1 : 1 : zipWith (+) fibs (tail fibs) λ: last $ take 47 fibs
2971215073

[0]:
[https://wiki.haskell.org/The_Fibonacci_sequence#Canonical_zi...](https://wiki.haskell.org/The_Fibonacci_sequence#Canonical_zipWith_implementation)

~~~
st1ck
It got mingled without newlines (and "λ:" prompt doesn't help readability). In
one line:

let fibs = 0 : 1 : zipWith (+) fibs (tail fibs) in fibs !! 47

Shorter but less readable:

fix (scanl (+) 0 . (1:)) !! 47

------
amelius
What happened to the "Great Programminglanguage Shootout"?

~~~
igouy
Google can be your friend :-)

[https://www.google.com/search?q=Great+Programminglanguage+Sh...](https://www.google.com/search?q=Great+Programminglanguage+Shootout)

~~~
amelius
Yeah, I meant why is the project so quiet lately, and why doesn't HN refer to
it more often.

~~~
igouy
Updates continue as usual:

[https://salsa.debian.org/benchmarksgame-
team/benchmarksgame/...](https://salsa.debian.org/benchmarksgame-
team/benchmarksgame/commits/master)

HN does refer to it:

[https://hn.algolia.com/?query=benchmarksgame&sort=byDate&pre...](https://hn.algolia.com/?query=benchmarksgame&sort=byDate&prefix&page=0&dateRange=pastYear&type=all)

As-always people crave novelty.

As-always it's easier to write simple 10 line programs rather than 100 line
programs for 10 different tasks.

------
khebbie
The rigth way to do fibonacci would probably be to add memoization...

~~~
toolslive
Why not calculate it in constant time ?

[https://artofproblemsolving.com/wiki/index.php?title=Binet%2...](https://artofproblemsolving.com/wiki/index.php?title=Binet%27s_Formula)

~~~
edflsafoiewq
You can't really compute it in constant time since the number of bits in the
nth Fibonacci number is O(n), so you need to take at least that long just to
write the result out.

Computing with Binet's formula is also rather tricky. You just need to round
φ^n/√5, but how many bits of √5 do you need to use?

~~~
toolslive
true: adding 2 64 bit numbers is constant time for me, but adding 2 4096 bit
numbers is not. Eventually, even the simplest operation becomes O(ln n)

------
stephc_int13
This is bullshit. IMHO, the recursive pattern is one of the worst constructs,
this is really bad engineering and should not be used for benchmarking...

~~~
codr4
So you're saying never use recursion? Or never benchmark it? I don't get it,
it's a tool; a pretty good tool when combined with tail call optimization.
Yes, the algorithm they use is naive; I believe that's part of the point.

------
firic
Why is python 2.7 faster than 3? I thought one of the big reasons to switch to
3 was because of performance

~~~
kec
Where did you hear that? 3 hasn’t even been competitive vs 2 until very
recently.

Reasons for the switch have always been better Unicode handling and not being
left behind as the community matches on (and soon lack of security updates to
2).

------
KirinDave
I'm just going to put this out there: this benchmark is silly, and only good
for making languages with specific types of calling styles look good. It
doesn't predict any real world performance. It doesn't speak to real world
optimizer outcomes. It really just measures how shallow your call abstraction
is. Several languages here perform much worse than equivalent code because the
way the language interprets and executes naive recursion is different.

I've seen this game before, so the very first place I looked was the Haskell
version. Sure enough, it doesn't even try to force the call graph at all.
There's even a section entitled, "breaks the benchmark" and it doesn't note
Haskell or other languages with different call semantics are not doing the
same thing at all. It just says Haskell doesn't terminate.

Confusingly, there is then a "mem" benchmark which seems to ask "What happens
if we do this with even the vaguest damn about algorithmic complexity?" But
these are so all-over-the-map they don't even come close to measuring the same
thing and have different memory usage.

What's frustrating about this is that the double-uncached is always the wrong
way to write this code. It's never good, it really doesn't benchmark anything
real world. It's not even a very good compiler benchmark because many
optimizers actually go under the hood and rewrite code that is in this obvious
style to be something else entirely, just to do better on these benchmarks.

For the Haskell benchmark as an example, the best way to to write this lazily
is to write a function that consumes and drops a list like so:

    
    
        fibF n = head $ drop n fibs
          where
            fibs = 0 : 1 : next fibs 
            next (first : rest) = (first + head rest) : next rest
    
        -- Or using a library function most Haskell Fp devs know.
        fibZip :: Int -> Int
        fibZip n = head $ drop n fibz
          where fibz = 0 : 1 : zipWith (+) fibz (tail fibz)
    
    

This is fast (it gets fused down to essentially a for loop in Haskell, and
others like Javascript & Clojure can use this technique for a memory-efficient
approach) it's very straightforward, and it's also _completely_ outside the
world of something you can write naturally in C or Java (there is a natural
Golang expression, but I don't see people using it).

This approach isn't even particularly fair to other languages that are _good_
at producing performant machine code, like Rust.

These games are not really indicative of anything. They waste energy. Folks
should understand what their language runtimes and compilers are capable of,
rather than asking, "How well does this fare on the worst possible algorithm
for an operation who's semantics are only loosely defined?"

------
enginaar
isn't there supposed to be javascript results?

~~~
nhumrich
Its listed under "node"

~~~
enginaar
oh i see, thanks

~~~
1mike12
which is "broken" due to V8's optimizations unfortunately

------
arisAlexis
so node 18 and python 500? really?

------
oweiler
Lua would probably outperform most languages on the list.

~~~
shakna
I couldn't find a list of the hardware used in the benchmarks, so comparing is
difficult, though before testing, I'd lean towards agreeing with you. Luajit
is often on par with C, D or Go.

However, as C was one of the faster, I'll use it as a comparison.

fib.c, compiled with 03: 10.49user 0.28system 0:11.62elapsed

    
    
        #include <stdio.h>
    
        long fib(long n) {
          if (n <= 1) return 1;
          return fib(n - 1) + fib(n - 2);
        }
    
        int main(void) {
          printf("%li\n", fib(46));
          return 0;
        }
    

Lua:

    
    
        function fib(n)
          if n <= 1 then
            return 1
          else
            return fib(n - 1) + fib(n - 2)
          end
        end
    
        print(fib(46))
    

Luajit: 48.66user 0.11system 0:52.95elapsed

Lua 5.3: 717.21user 8.40system 14:19.47elapsed

As Luajit was so much slower than C for this, which can be somewhat
surprising.

Luajit would probably beat Ruby, for it's interpreted crown, but without
optimisation, it won't beat the big boys.

I would say, that Lua isn't a good fit for solving this kind of problem, with
these constraints, because every function call requires a hash lookup, which
is irritating.

Of course, you could use Luajit's FFI to use C's implementation, which would
be somewhat faster. Or expose the C implementation as a Lua library.

However, Lua is probably also a really good fit for memoization, and other
techniques like that.

    
    
        local nums = {}
        local fib
    
        fib = function(n)
          if n <= 1 then
            return 1
          else
            if nums[n] then
              return nums[n]
            else
              nums[n] = fib(n - 1) + fib(n - 2)
              return nums[n]
            end
          end
        end
    
        print(fib(46))
    

This is a fairly naive implementation, but has the same final result as the
previous examples... And 'time' is unable to measure how fast it is (Both
Luajit and Lua5.3). For all intents and purposes, it's instant.

~~~
drujensen
Hardware used is listed at the top of the readme.
[https://github.com/drujensen/fib/blob/master/README.md](https://github.com/drujensen/fib/blob/master/README.md)

------
octonion
I have a similar GitHub repo, but with a more complicated example:

[https://github.com/octonion/puzzles/tree/master/blackjack](https://github.com/octonion/puzzles/tree/master/blackjack)

------
super_mario
Using generators in Python you can do this much much faster :D.

    
    
        #!/usr/bin/env python
    
        """
        Calculate fibonacci numbers using a generator
        Usage: fibonacci.py [options] <N>
    
        Arguments:
            <N>             Print all Fibonacci numbers up to <N>-th
    
        Options:
            -h, --help      This help
    
        """
    
        from docopt import docopt
    
        def fibonacci():
            """generate fibonacci numbers"""
            a, b = 0, 1
            while 1:
                yield a
                a, b = b, a + b
    
        if __name__ == '__main__':
            try:
                args = docopt(__doc__)
                fib = fibonacci()
                for i in range(int(args["<N>"])):
                    print fib.next()
            except ValueError:
                print "You must specify valid integer"
            except KeyboardInterrupt:
                print "Good-bye"
    
    

You get something like

    
    
        $ time ./fibonacci.py 46
        0
        1
        1
        2
        3
        5
        8
        13
        21
        34
        55
        89
        144
        233
        377
        610
        987
        1597
        2584
        4181
        6765
        10946
        17711
        28657
        46368
        75025
        121393
        196418
        317811
        514229
        832040
        1346269
        2178309
        3524578
        5702887
        9227465
        14930352
        24157817
        39088169
        63245986
        102334155
        165580141
        267914296
        433494437
        701408733
        1134903170
    
        real	0m0.038s
        user	0m0.015s
        sys	0m0.018s

~~~
danbruc
_Recursive_ fibonacci benchmark - you are doing something different in which
case you could just evaluate the closed form φⁿ / √5 and be even faster.

~~~
super_mario
I'm perfectly aware it's meant to be recursive, but it's also a completely
pointless test. It's not tail recursive, so you are measuring function call
overhead in various languages. But various languages have options to perform
much faster anyway, but solution is of course going to be language specific.

