

Haskell as fast as C: working at a high altitude for low level performance - Menachem
http://donsbot.wordpress.com/2008/06/04/haskell-as-fast-as-c-working-at-a-high-altitude-for-low-level-performance/

======
bo1024
This is very cool, but it still makes me extremely skeptical. In brief, the
more posts of this type I see, the more I'm implicitly being convinced that
writing performant code in high-level languages requires a series of tricks
that get around the "natural" or "naive" way to write the code. The impression
I get is that high-level languages allow you to express your thoughts
concisely, or they perhaps allow you to mimic the performance of low-level
languages, but not both at the same time.

Notice that it takes an entire blog post to use Haskell to emulate the
performance of a straightforward 20-line C program that can be written in
under 60 seconds. There are high-performance tasks which are practically
trivial in an imperative language but which merit a conference paper when
accomplished in Haskell.

So it's hard to see something like this as an argument that higher-level
languages are simplifying anything.

~~~
kruhft
I've always wondered why language designers don't strive to optimize for the
trivial cases. There is a natural way to design code for a beginner, and
having that map to the most efficient constructs seems to be one of the areas
of programming language research that is lacking. Maybe it's the need for the
'sufficiently smart compiler' that is holding things back, but given that PLs
are abstractions, couldn't the underlying abstraction be completely different
than the user visible abstractions?

~~~
Periodic
I believe they do optimize for the simple cases, but just in very different
ways. Haskell optimizes for composition, abstraction and expressiveness in a
function-application sense. C optimizes for imperative loops, simple functions
and controlling your memory layout and execution precisely [There is probably
a better characterization of C, please comment].

I think there are plenty of tricks involved in writing performant C code that
aren't obvious. Things like cache behavior, memory access patterns, etc. The
job of the compiler and PL is to help us by making it unnecessary to worry
about such things unless we really need to.

It's a testament to the power of modern programming languages and computer
speeds that there are many programmers who don't understand registers, caches,
assembly, virtual memory, etc.

------
duaneb
Having seen hundreds of these types of blog posts, touting faster-than-c
(superluminal?) benchmarks in arbitrary languages, I'm extremely skeptical.
Real-world applications are rarely solely limited by small chunks of code that
are simple enough to be optimized independently of the rest of the program.
Accurate comparisons should be done on large, complex code bases that mirror
an equivalent C program.

Unfortunately, those don't exist, so in my mind the true performance potential
of haskell is still unknown. I do have high hopes for the language, especially
since whole-program optimization and aggressive inlining/code folding should
yield very, very efficient code, but as of yet _the only large programs in
haskell remain GHC and darcs_ , and darcs is extremely slow.

Still: a single benchmark showing a good result is better than one showing a
bad result.

~~~
joeyh
At 15 thousand lines, git-annex is only 5 thousand lines less of haskell than
darcs. I happen to know, since I wrote git-annex. I'm not sure where you're
coming from with your statement about there only being two large haskell
programs.

For that matter, I don't know if I'd consider darc's 20 kloc very large. Or
that I'd consider another haskell program I wrote, github-backup, to be small
-- that 2 kloc program sits at the apex of a lot of libraries, and probably
combined they have more lines of code than darcs. Your whole premise about
lines of code feels thoroughly flawed to me.

Anyway, As a git extension, git-annex is expected to run quite fast. I've
never had any difficulty, in writing git-annex, with the speed of haskell
code. I'm sure darcs is slow due to its patch theory thing, not due to its
implementation language.

~~~
duaneb
> I'm not sure where you're coming from with your statement about there only
> being two large haskell programs.

I concede the point, but I'd rather say that there's only _one_ large haskell
program.

Let's consider large C/C++ programs:

1\. Chromium clocks in around 4 million lines of code.

2\. LLVM/clang together has a (fuzzy) estimate of ~1165539 LOC on my system.

3\. GHC has unknown lines of code, and I am unsure as how to count it, whether
to include the standard library, etc, but it has at least 100k, so we'll just
call that sufficiently large.

Now, I know haskell can be fast—I've spent a lot of time myself hand-tuning
ghc processed code. But I've noticed that unless I can trace the bottleneck to
a single function or small set of functions, optimization to near-c levels is
extremely difficult. Now, even 2-3x slower than c—far slower than ghc-
generated code—is still very fast, so I'm not calling haskell slow. Far from
it.

~~~
joeyh
Actually, there are packages in hackage with > 100 kloc: CHXHtml KiCS-debugger
HaRe

------
comex
I was going to go into a series of optimizations that could make the C code
faster, starting by comparing against the integer counter rather than the
double, then switching to vector math if applicable...

Then I realized that all this program does is compute an approximation of `(d
+ 1) / 2`. If you need the exact same result as it computes
(500000000.067109), it's hard to add any parallelism because it depends on the
imprecision of the intermediate add results, but if you can settle for the
mathematically correct solution (500000000.500000), it's hard to distinguish
between "fair" optimizations and "unfair" ones - indeed, if you go ahead and
just change all the doubles to long longs (they can fit), the compiler will
automatically use the formula.

In general, I think that to get maximum performance on many problems on this,
you _must_ make assumptions the compiler cannot, which means your code
_cannot_ be naive; but the difference between naive and non-naive code in C is
much less than in Haskell.

------
neutronicus
A year or so ago I tried for a couple days to write a 1-D neutron transport
solver based on this post, and found the experience frustrating in the
extreme. Obviously, I'm no Haskell expert, but I was really turned off of
using Haskell for numerical work.

I'd really like to give it a go again, but I feel like I'm missing a lot of
the knowledge required to solve P-(I)DEs in any kind of moderately clever way
in Haskell. If anyone can point me towards a nice resource for Haskell
numerics I'd be grateful (repa is not flexible enough for my needs).

~~~
jberryman
A lot of really clever folks monitor the Haskell tag on stackoverflow,
including dons. I would post there.

------
berkut
Getting languages to run as fast as (or faster than) C for things like tight
floating point or int calculations isn't _that_ difficult - at the end of the
day, any decent compiler JIT or pre-compiled is going to produce roughly the
same asm.

What's difficult is getting the whole application as fast, as opposed to just
a few functions. The hot points of an application are rarely due to CPU
throughput or lack of asm instruction optimisation - they're normally due to
memory allocation inefficiencies or cache thrashing, or bad thread
concurrency, and _this_ is where using C/C++/(ADA in embedded world) shine, as
you have complete control over pretty much everything, from struct bit
packing, allocation size, allocation location (stack/heap), when to deallocate
memory (if ever in real-time's case), optimising for memory access patterns,
etc, etc.

~~~
stcredzero
_The hot points of an application are rarely due to CPU throughput or lack of
asm instruction optimisation - they're normally due to memory allocation
inefficiencies or cache thrashing, or bad thread concurrency, and this is
where using C/C++/(ADA in embedded world) shine, as you have complete control
over pretty much everything_

What you are saying is that C/C++/ADA are wonderful because compilers optimize
one thing (CPU throughput/instruction optimisation) whereas other factors are
more important (memory allocation inefficiencies or cache thrashing, or bad
thread concurrency).

Whenever a programmer says they need manual control of [X] -- it's time to
start looking at automation of [X]. (It may not work in all contexts,
however.)

~~~
berkut
There's a reason languages with garbage collectors aren't used where speed is
important - because they normally get in the way.

~~~
stcredzero
I wasn't thinking of GC. GC's been around awhile, and is quite advanced, yet
is not desired for certain purposes.

Yet, there must be something that programmers conceptualize when they allocate
a struct, do whatever they do with it, then free it. Everything that I know
about how programmer's minds work tells me that, most likely, 80% of this work
is fairly mechanical.

Then again, you might want to look at iGC. (Granted, phones are pretty
ridiculously powerful in comparison to lots of embedded devices.) By tailoring
their GC to the particular way it's used, they can do interesting
optimizations. (Use comparisons to addresses to drastically reduce the number
of roots for tracing.)

------
jberryman
(2008)

...not that things have gotten slower since then.

~~~
z92
Thank you. This [2008] tag indicates I probably read it when it came out.
Therefore didn't click on it now. And you probably have saved some of my time.

Just wanted to inform you and others how much these little bits of information
help us.

------
wingo
> The fix is straightforward: just use a strict pair type for nested
> accumulators:

Uf. Haskell impresses me a lot, but it seems that performance-wise, it would
be better if it were strict by default.

~~~
Tyr42
But the rest of the fusion and other transformations work better when lazy.

------
it
I tried to reproduce the results, but the Haskell bit failed to compile on my
Mac with the -fvia-C flag. Without the -fvia-C flag, the Haskell version takes
about 11 times as long to run as the C version. A simple Go version runs at
the same speed as the C version.

I posted some code to run the comparison at <https://github.com/ijt/fast-as-c-
article>. Here are the results:

    
    
        [ issactrotts ~/haskell/fast-as-c-article ] ./compare
        == cmean ==
        gcc -O2 -o cmean cmean.c
        mean: 500000000.500000
    
        real    0m0.774s
        user    0m0.770s
        sys 0m0.002s
        == gomean ==
        6g gomean.go
        6l -o gomean gomean.6
        mean: 500000000.500000
    
        real    0m0.776s
        user    0m0.769s
        sys 0m0.003s
        == hsmean ==
        ghc -O2 hsmean.hs -optc-O2 -fvia-C --make
        [1 of 1] Compiling Main             ( hsmean.hs, hsmean.o )
        In file included from /usr/local/Cellar/ghc/7.0.4/lib/ghc-7.0.4/include/Stg.h:230,
    
                         from /var/folders/g2/ylbqfw5533n65z6qkxljg0_h0000gn/T/ghc23078_0/ghc23078_0.hc:3:0:
         
    
        /usr/local/Cellar/ghc/7.0.4/lib/ghc-7.0.4/include/stg/Regs.h:177:0:
             sorry, unimplemented: LLVM cannot handle register variable ‘R1’, report a bug
        make: *** [hsmean] Error 1
        == hsmean_nollvm ==
        ghc -O2 hsmean.hs -optc-O2 --make -o hsmean_nollvm
        [1 of 1] Compiling Main             ( hsmean.hs, hsmean.o )
        Linking hsmean_nollvm ...
        ld: warning: could not create compact unwind for .LFB3: non-standard register 5 being saved in prolog
        mean: 500000000.067109
    
        real    0m8.886s
        user    0m8.863s
        sys 0m0.018s

~~~
dons
These days you'd use the -fllvm flag

------
eta_carinae
Please don't make any claims about high performance based on micro-benchmarks.

------
jrockway
I started reading this and wondered, "why is dons back to using via-C instead
of LLVM". Then I realized that this article is 4 years old.

------
forgottenpaswrd
If you are new to programming languages, please don't listen to those non
sense articles "X language faster than c" that comes from people that are
religious about high level languages.

If you have only a hammer, everything is a nail.

It is a very dangerous meme that will make you incompetent in the real world
when real life high level languages are >100 times slower that programs
written in c(from people that know what the computer is doing as they knew
assembler first).

Microsoft geniuses fell on this meme and as a result created Windows Vista,
and a 50KB file transfer could take you 20 minutes.

In Android garbage collector will start collecting memory once it wished and
will visually break the continuity of the screen, making it irresponsible at
times. This is unacceptable for Apple that used c for this reason(yes c, not
Obj c). Samsumg and HTC started using c too for this.

C has this place, high level languages have their place. you trade abstraction
for control and (if you know what you are doing) performance.

Go and learn low level and high level and decide for yourself witch one is
appropriate for what circumstance.

E.g the fast python in python, things like numeric python, are written in c
for this reason(once they discover how slow high level programming was in real
life).

If you are going to spend the same time optimizing language X on c, you can
make c super fast as well, not 10% faster, with using c you should get 100%,
1000% or 10000%.

Sorry, I feel super dumb by having to say the obvious, but good programmers
are busy coding and the void is filled with non sense.

~~~
derleth
You can say all the same things if you replace 'C' with 'assembly' and 'high-
level languages' with 'C'.

~~~
forgottenpaswrd
C is "portable assembler".

If I do something like

if (foo = bar) { }

I know that the computer is comparing one value to another(witch means an op,
eg.substraction ), then doing a jump.

I don't need to code the assembler, but I could estimate how much it takes
everything really well(orders of magnitude). This is invaluable with coding.

If you do: [object foo] in objective c

or object.foo()

You lost all you control, sometimes classes will use iterative methods or
alloc and free memory every time you call a method, from nested methods,
making it super slow.

Sometimes I want abstraction, sometimes I want to know what is happening.

~~~
Xurinos
This just isn't true. C is not portable assembler. It was never intended to
be. I hear it claimed, and it is wrong every time somebody calls C a low-level
language close to assembler. You can make some roughly reasonable assumptions
about what comes out of the compiler, but often it is not what you think it
is.

Let's challenge this specific claim, that when you do "if (foo == bar)" -- I
corrected the syntax error, which is a symptom of C's high-level syntax and
not of the underlying assembly code -- you compare one value to another and
then jump. For this challenge,I will write some trivial code that we should be
able to make easy assumptions about, and I will compile it with debugging
enabled so that I can dump the results with gdb.

    
    
      $ gcc -g example.c
    
      1       #include <stdio.h>
      2
      3       int main() {
      4          int foo = 10;
      5          int bar = 20;
      6          if (foo == bar) {
      7             printf("Fun\n");
      8          }
      9          return 0;
      10      }
    
      Dump of assembler code for function main:
      0x0000000100000ef8 <main+0>:    push   rbp
      0x0000000100000ef9 <main+1>:    mov    rbp,rsp
      0x0000000100000efc <main+4>:    sub    rsp,0x10
      0x0000000100000f00 <main+8>:    mov    DWORD PTR [rbp-0x4],0xa
      0x0000000100000f07 <main+15>:   mov    DWORD PTR [rbp-0x8],0x14
      0x0000000100000f0e <main+22>:   mov    eax,DWORD PTR [rbp-0x4]
      0x0000000100000f11 <main+25>:   cmp    eax,DWORD PTR [rbp-0x8]
      0x0000000100000f14 <main+28>:   jne    0x100000f22 <main+42>
      0x0000000100000f16 <main+30>:   lea    rdi,[rip+0x19]        # 0x100000f36
      0x0000000100000f1d <main+37>:   call   0x100000f30 <dyld_stub_puts>
      0x0000000100000f22 <main+42>:   mov    eax,0x0
      0x0000000100000f27 <main+47>:   leave  
      0x0000000100000f28 <main+48>:   ret    
    
    

We see that in the very basic version of this code with absolutely no
optimizations and doing the silliest things that we can, we store our two
values into some memory locations, perform a comparison (cmp), and jump if not
equal. We can see that the jump leads us to the puts() call.

Now, let's get smarter. The variables foo and bar do not change value, and we
only work with two variables in the routine. Therefore, we could optimize by
storing those values in temporary registers instead of using expensive memory
transfers. Further,since our two constants are being compared and will always
return a false, we actually have a section of code -- the printf -- that is
dead code, that can be completely removed from final compilation. Well, that's
simple, and everyone who uses C in production at least turns on some minor
optimization:

    
    
      $ gcc -g -O1 example.c  # the only difference is the -O1
    
      Dump of assembler code for function main:
      0x0000000100000f34 <main+0>:    push   rbp
      0x0000000100000f35 <main+1>:    mov    rbp,rsp
      0x0000000100000f38 <main+4>:    mov    eax,0x0
      0x0000000100000f3d <main+9>:    leave  
      0x0000000100000f3e <main+10>:   ret    
    
    

This does not look like our C code at all! And thankfully so! What a waste of
space and CPU time it would have been had we treated C like an interpreted
language! C is a high-level language with numerous compiler implementations
that can intelligently convert the human-readable code into the binary code
that represents the real situation behind the code.

The point here is that you are not properly guessing the assembler code that
will be produced. The compiler is doing a better job of that; that is the
compiler's job. As a programmer, you can just focus on the algorithm. C is not
an assembler macro language. For that, you would use things like "gas".

~~~
kevinnk
C is not assembly and hasn't been for a very long time. But I think when
people use the people use the phrase "portable assembler" they really mean
that in C you both control the memory layout of data types very finely and
that code maps very directly to an equivalent assembly construct. True,
optimizers frequently change the actual executed code from what what we
expect, but C gives a very intuitive feel of what the "upper bound" assembly
output is.

For example in C "array[0] = (x + y);" will never be _more_ than a couple
assembly instructions long. In many languages, including Haskell (and in the
case of operator overloading, C++), the equivalent construct might map to
hundreds if not thousands of instructions. Or it might map to the same one or
two that C would emit. It's impossible to know and there is no reasonable
upper bound on what could happen.

~~~
stcredzero
_might map to hundreds if not thousands of instructions. Or it might map to
the same one or two that C would emit. It's impossible to know and there is no
reasonable upper bound on what could happen._

Over every possible piece of code that could be compiled anywhere, this might
well be true. But for a properly informed programmer for a given piece of
code, not so much.

~~~
kevinnk
>But for a properly informed programmer for a given piece of code, not so
much.

There are a couple reasons that even for "informed" programmers this is still
important

1) For most dynamic languages, even simple operations can take a highly
variable amount of time to execute. How many instructions does an array access
take in Javascript? The answer depends on everything from the state of the JIT
to the types involved, both of which are usually impossible to know before
hand. In C we can answer this pretty easily _.

2) The modern trend is towards writing more and more generic code. Even for
statically compiled languages like C++ and Haskell, the actual underlying
operations are _purposely* abstracted away from you. Unless you know every
possible instance that your code could be used it is impossible to know how
long _any_ operation will take.

And all this is assuming that the programmer knows everything about their
compiler, assembler, standard library, imported libraries, ect, which isn't
true for all but the most expert programmers.

*Admittedly, the actual length of time it takes is dependent on the state of the processor which can be very difficult to predict, but we will have a lot more information than we would have had otherwise.

~~~
stcredzero
You need to take both the "informed" and "given." Not all pieces of code are
"cross platform" and even within that, there's different levels.

In other words, you're talking about one end of the spectrum. You are right,
though, that things are moving in that direction.

------
stcredzero
It seems to me that machine learning has advanced to the point, where those
"obligatory" points made again and again in things like language efficiency
threads could actually be automated.

What if we used a Bayesian filter for threads where such "obligatory" points
on both sides of a heated debate are made again and again, then use machine
learning to post well a written and well curated set of "obligatory" comments?
This would save a lot of man-hours online.

