
Haskell Beats C Using Generalized Stream Fusion [pdf] - profquail
http://research.microsoft.com/en-us/um/people/simonpj/papers/ndp/haskell-beats-C.pdf
======
kevingadd
Before you say 'oh, Haskell is still only half the speed of C++', you should
keep in mind _just_ how insane the C++ code they had to write looks. The Eigen
code they had to write is pretty nasty and getting it correct required some
particular understanding.

In comparison, the performance they're getting out of Haskell is coming from
comparatively mundane, easy to understand code. That's pretty impressive.

Or another way of putting it: It looks like naive Haskell outperforms naive
C++, and naive Haskell is not that much slower than carefully-tuned C++.

    
    
        template < typename Derived >
        typename Derived::Scalar norm2( const MatrixBase <Derived >& v ) {
            return v.dot(v);
        }
    

Why is it a template function? Why does it return Derived::Scalar instead of
double? Dear god, Eigen is magic.

~~~
pbsd
While 'Derived::Scalar' could very well be replaced by 'double' in that case,
you need the templated function to catch the temporary type (expression
template) that represents the u-v computation. This lets Eigen compute
u[i]-v[i] and the dot product inside the same inner loop.

Without the templated function, the compiler would first generate the
subtraction into a temporary vector, and then compute the dot product on that
vector.

~~~
kevingadd
That makes sense, thanks! I figured the template was being used to magically
substitute some sort of meta-type or container for temporaries.

------
jules
Although the results are impressive, the title is a bit optimistic. Haskell
beat GCC slightly in the first benchmark. Haskell beat C in the second
benchmark because the C code is very poorly written. They did not compare with
the obvious C implementation. Secondly, GCC isn't the best C compiler, even
though the paper claims that it is the best compiler that "we could find".
They should have used ICC. C++ is also 4x faster than Haskell in the second
benchmark, giving further indication that the C is slow just because it is
poorly written. So it's hard to justify such a title. The work itself is very
good however.

~~~
floody-berry
The first C example is pretty disingenuous as they explicitly added loop
unrolling and prefetching tuned to x86-64 to the Vector library, then compared
the basic C versions with with "-funroll-loops" (and no prefetching based on
the icc/gcc/clang compiler output I looked at). For some reason they also use
"-msse4.2" on a CPU supporting AVX while the Haskell version is generating AVX
instructions.

The paper would have been better off with more complex examples (e.g.
functions that can't be trivially implemented in asm by hand with no register
pressure) and fewer comparisons to implementations that could be doing what
they are, but aren't.

~~~
mainland
First author here. The dot product example was compiled with GCC 4.7.2 -O3
-msse4.2 -ffast-math -ftree-vectorize -funroll-loops; see the caption to
Figure 5. What compiler options would you have suggested? The Haskell version
only used SSE instructions, _not_ AVX; this should have been made clear in the
paper.

The more complex examples are in Section 5.2; see Figure 8. Granted, we would
have liked to have done more, but deadlines are deadlines...

~~~
pbsd
Figure 7 shows Haskell-generated AVX instructions, albeit only using the lower
128 bits. That code would not run on an SSE4.2-capable Nehalem, for instance.

There are some other CPU-related slight inaccuracies in the paper. Prefetching
is repeatedly mentioned, even though its effect is negligible when one has a
perfectly linear memory access pattern; unaligned loads are mentioned as a
performance hit, but they are essentially free on the test processor (2600k,
Sandy Bridge).

Matrix multiplication would perhaps be a better example to show the power of
clever prefetching.

------
Xcelerate
I am curious why every time there is a "language" beats "other language" in
some test posted on here, there is an inevitable slew of comments along the
lines of "losing language was poorly written code".

Is the probability really so high as to be 100% that all optimization tests
were against a poorly written competitor? That's amazing if that's the case.

~~~
jules
I suspect this comment is at least partially directed at me. Did you read the
paper? Excluding some O(1) operations, the goal of the second benchmark is
this:

    
    
        double s = 0;
        for(int i=0; i<n; i++) s += pow(a[i]-b[i],2);
        return s;
    

But they did not use this C code. The C code they used made 3 calls to the
BLAS library. The end result is that the C code is doing something equivalent
to this:

    
    
        double x=0, y=0, z=0;
        for(int i=0; i<n; i++) x += a[i]*a[i];
        for(int i=0; i<n; i++) y += a[i]*b[i];
        for(int i=0; i<n; i++) z += b[i]*b[i];
        return x - 2*y + z;
    

While this does return the same result (assuming infinite precision
arithmetic) it is obviously not the way anybody would do it, since it's doing
3 times the work. Even worse, because their code is calling into the BLAS
library for each loop, the compiler is explicitly prevented from optimizing
the three loops and memory accesses by combining them into one. Note that the
Haskell code is doing the efficient single loop. So yes, it _is_ poorly
written code.

~~~
ky3
There are 2 pieces of C benchmarked. The other one computes `a[i]-b[i]` into a
temporary array and then makes a single call to BLAS. The speed is roughly
half that of Haskell, which dispenses with the space cost of the temporary
array.

~~~
jules
Yes, you're right, the other C code is even less efficient.

~~~
ky3
Good catch! It's the 3x BLAS that's 1/2 of Haskell's speed.

The 1x BLAS that uses 2x RAM gets shafted, presumably by cache lossage.

------
taeric
"The C++ programmer must worry about the performance implications of
abstraction."

This quote seems a touch odd when you consider just how much faster the C++
solution is. The Haskell programmer definitely does not get a complete pass on
worrying about the implications of abstraction. Indeed, this seems to
underscore that if performance is a major criteria, than language choice still
matters.

~~~
GhotiFish
I agree, auditing Haskell's performance characteristics is not easy. Then
again, nothing in Haskell is easy.

~~~
ufo
While its true that lazy evaluation makes it harder to think about performance
(specially memory use), the flipside is that it makes it easier to reason
about programs at a higher level and you have more freedom when defining
abstractions.

For example, in Haskell you can freely rewrite common subexpression with a let
whenever you want and in whatever order you want.

    
    
        --original
        (f x) ... (f x)
    
        --let
        let y = f x in y ... y
    

On the other hand, if your language has side effects these sort of
transformations are not always safe.

~~~
plinkplonk
having side effects and having lazy evaluation are distinct features though.
One can conceive of a strictly evaluated haskell-like which isolates side
effects in monads and is still largely composed of pure functions.

~~~
tome
One can conceive of it, and yet it's never been done. Curious.

~~~
colanderman
Yes it has. Mercury is a perfect example of a purely functional language which
is not lazy. I/O is made functional via a uniqueness typing system.

~~~
tome
Right, so it uses uniqueness typing and not monads presumably.

------
CJefferson
While it is interesting to see generalised stream fusion, the conclusions are
that Haskell is still half the speed of C++.

Also, looking at the results most of the performance comes from which library
you are using, and how well it is optimised. No compiler is good enough to do
what the good libraries do.

~~~
gngeal
"No compiler is good enough to do what the good libraries do."

That's nonsense. And if you take a look, say, at NumPy, which uses heavily
tuned BLAS/LAPACK code, and Numexpr, which uses the "compile the snippet and
run it" approach, you'll see why. Libraries won't save your performance if you
can't optimize across procedure and module boundaries.

~~~
ori_b
Which is why C and C++ under most state of the art compilers support link time
optimization.

~~~
gngeal
That won't help you much, if you, say, need the fusion of matrix operations I
mentioned.

------
andrewcooke
i followed (and enjoyed) most of that, but i don't get (perhaps because i've
not used haskell in a _long_ time) how the consumer is automatically(?)
chosen.

from what i understand of section 3.2 the consumer is critical in selecting
the correct stream type from the bundle. who decides that? does the user have
to select the appropriate function? or does it some how fall out naturally
from the use case? or does the compiler select it?

~~~
mainland
The library writer is the "consumer" here. The programmer just uses the
library, and the library chooses the proper stream representation.

Of course the programmer can also use the lower-level stream interface
directly if desired, but then the programmer must also know which stream
representation to choose.

------
33a
If Haskell beats C on that loop, then that is a compiler bug. Did they try
repeating these experiments using say icc?

~~~
kevingadd
Personally, I would say that leaping straight to 'compiler bug' is a little
unwarranted. If you really think it's a compiler bug, you should compile the
code (they provided it) using the settings they used (I think they provided
that too; you can see them in a comment in this thread, actually) and then
look at the output assembly and find the bug and file a bug against the
compiler.

