Haskell Beats C Using Generalized Stream Fusion [pdf]

kg · on April 7, 2013

Before you say 'oh, Haskell is still only half the speed of C++', you should keep in mind just how insane the C++ code they had to write looks. The Eigen code they had to write is pretty nasty and getting it correct required some particular understanding.

In comparison, the performance they're getting out of Haskell is coming from comparatively mundane, easy to understand code. That's pretty impressive.

Or another way of putting it: It looks like naive Haskell outperforms naive C++, and naive Haskell is not that much slower than carefully-tuned C++.

    template < typename Derived >
    typename Derived::Scalar norm2( const MatrixBase <Derived >& v ) {
        return v.dot(v);
    }

Why is it a template function? Why does it return Derived::Scalar instead of double? Dear god, Eigen is magic.

pbsd · on April 7, 2013

While 'Derived::Scalar' could very well be replaced by 'double' in that case, you need the templated function to catch the temporary type (expression template) that represents the u-v computation. This lets Eigen compute u[i]-v[i] and the dot product inside the same inner loop.

Without the templated function, the compiler would first generate the subtraction into a temporary vector, and then compute the dot product on that vector.

kg · on April 7, 2013

That makes sense, thanks! I figured the template was being used to magically substitute some sort of meta-type or container for temporaries.

dfbrown · on April 7, 2013

I'm not sure why they wrote this in the first place. Eigen's MatrixBase already has a squaredNorm function: http://eigen.tuxfamily.org/dox/classEigen_1_1MatrixBase.html...

edit: I see, they address this in the article: In this speciﬁc case the Eigen library has the squared L2-norm in its API, but in more complicated examples there might be no such built-in provision

marshray · on April 7, 2013

That code doesn't look particularly insane to me.

Uh oh.

ibotty · on April 7, 2013

if i understand the paper correctly (and apply the knowledge to eigen), it't because of the same reason: you want to allow different stream fusion strategies to work.

dchichkov · on April 8, 2013

The code (actual instructions) that beats C++ in their example is being generated by their experimental "stream fusion" framework. And NOT by a generic Haskell compiler.

And now the question, do you really want to an unproven and unstable Haskell framework? While there are stable alternatives (lake Eigen, Theano, etc) available?

So no. Haskell (production ready Haskell) doesn't even come close.

jules · on April 7, 2013

Although the results are impressive, the title is a bit optimistic. Haskell beat GCC slightly in the first benchmark. Haskell beat C in the second benchmark because the C code is very poorly written. They did not compare with the obvious C implementation. Secondly, GCC isn't the best C compiler, even though the paper claims that it is the best compiler that "we could find". They should have used ICC. C++ is also 4x faster than Haskell in the second benchmark, giving further indication that the C is slow just because it is poorly written. So it's hard to justify such a title. The work itself is very good however.

mpyne · on April 7, 2013

> C++ is also 4x faster than Haskell in the second benchmark, giving further indication that the C is slow just because it is poorly written.

That would imply that the C version of the C++ code would be just as fast, if only it were not "poorly written".

However there are some types of code (including this kind) where having C++ available to generate the code for you makes it almost inherently faster than C.

You could certainly reimplement specific matrix operations for very specific types in C, but the C++ Eigen library does this all for you at compile time, and does it the best way for each combination of types and operations.

If you do this in C, even with macros everywhere, it's going to be much more difficult, especially since you can't (AFAIK) declare and discard temporary types within macros, though C11 at least allows you to refer generically to types within macros. The canonical example for this kind of disparity is C's qsort() vs. C++ templated sorts.

This is kind of the point to the paper. Performance of this sort is only possible in C++ if you're careful to follow the instructions provided with Eigen about using templated functions, and some very smart devs had to write Eigen in the first place. The Haskell code comes fairly close starting from the source code you'd write naïvely, and without tons of optimization on the stream fusion code generator.

jules · on April 7, 2013

In general, you're absolutely right. For this particular benchmark, this is not the case. See here: https://news.ycombinator.com/item?id=5508948

floody-berry · on April 7, 2013

The first C example is pretty disingenuous as they explicitly added loop unrolling and prefetching tuned to x86-64 to the Vector library, then compared the basic C versions with with "-funroll-loops" (and no prefetching based on the icc/gcc/clang compiler output I looked at). For some reason they also use "-msse4.2" on a CPU supporting AVX while the Haskell version is generating AVX instructions.

The paper would have been better off with more complex examples (e.g. functions that can't be trivially implemented in asm by hand with no register pressure) and fewer comparisons to implementations that could be doing what they are, but aren't.

mainland · on April 7, 2013

First author here. The dot product example was compiled with GCC 4.7.2 -O3 -msse4.2 -ffast-math -ftree-vectorize -funroll-loops; see the caption to Figure 5. What compiler options would you have suggested? The Haskell version only used SSE instructions, not AVX; this should have been made clear in the paper.

The more complex examples are in Section 5.2; see Figure 8. Granted, we would have liked to have done more, but deadlines are deadlines...

pbsd · on April 7, 2013

Figure 7 shows Haskell-generated AVX instructions, albeit only using the lower 128 bits. That code would not run on an SSE4.2-capable Nehalem, for instance.

There are some other CPU-related slight inaccuracies in the paper. Prefetching is repeatedly mentioned, even though its effect is negligible when one has a perfectly linear memory access pattern; unaligned loads are mentioned as a performance hit, but they are essentially free on the test processor (2600k, Sandy Bridge).

Matrix multiplication would perhaps be a better example to show the power of clever prefetching.

floody-berry · on April 7, 2013

The Haskell assembly is using the 3-op AVX instructions, but not 256bit registers. The example is simple enough that it may not make a difference (2-op vs 3-op instructions), but in a more complex function with register spills it could have. ("-mavx" should be enough).

all that said, it is still very impressive work!

jules · on April 7, 2013

I would be surprised if prefetching helped at all, given that the memory access pattern is completely linear. But yes, it's always a problem that the work in a paper is tuned to the particular machine the benchmark is run on, while the things that are compared against are not. Also agreed that it's suspicious that Haskell is using AVX while the C isn't. However compared to the second benchmark, the first one is very fair ;-)

Perhaps somebody with an AVX capable processor can redo the two benchmarks with correct compiler options (i.e. -mavx) and/or with ICC, and replacing the C code of the second benchmark with code that is equivalent to the Haskell code:

    double s=0;
    for(int i=0; i<n; i++) s += pow(a[i]-b[i],2);

floody-berry · on April 8, 2013

I made a quick test for the C versions to try out flags and such: https://gist.github.com/floodyberry/5335542

I found out gcc supports -fprefetch-loop-arrays, although it is not guaranteed to have a positive effect. In this case, it does appear to run faster than without prefetching. AVX versions also run faster than the standard SSE counterparts even with 128 bit registers. ymm regs are faster than xmm.

gcc is faster than icc actually, and icc runs faster with the plain c version than the vectorized versions. I can't figure out how to make icc prefetch either, the switch that controls it doesn't seem to have an effect.

jules · on April 8, 2013

Nice! What were the results?

kostya-kow · on April 7, 2013

GHC and GCC are 2 of the most widely used compilers for the respective languages. The research uses most popular compilers, which are used by many different projects, not the most micro-optimized once.

jules · on April 8, 2013

The paper claims otherwise.

army · on April 8, 2013

That sums it up pretty well: great work, but take the evaluation with a grain of salt.

Xcelerate · on April 7, 2013

I am curious why every time there is a "language" beats "other language" in some test posted on here, there is an inevitable slew of comments along the lines of "losing language was poorly written code".

Is the probability really so high as to be 100% that all optimization tests were against a poorly written competitor? That's amazing if that's the case.

moomin · on April 7, 2013

It seems to me that most X beats Y posts are written by people with a great deal of knowledge of X, but without a matching knowledge of Y. This, unfortunately, leaves every post wide open to this sort of attack.

taeric · on April 7, 2013

I would wager a large contributor to this is that the idioms of some of these specialized fields do not generalize to other use cases.

So, while some researchers have gotten it such that a relatively straight forward implementation in a new language with new tools is better than the old way, the folks working the old way have not exactly stagnated.

This seems especially likely when you consider that for many of the really fast implementations, it is not that they are using no abstractions. The abstractions they are using are very close to the machine, because the machine is the abstraction they are using. (That make sense?)

jules · on April 7, 2013

I suspect this comment is at least partially directed at me. Did you read the paper? Excluding some O(1) operations, the goal of the second benchmark is this:

    double s = 0;
    for(int i=0; i<n; i++) s += pow(a[i]-b[i],2);
    return s;

But they did not use this C code. The C code they used made 3 calls to the BLAS library. The end result is that the C code is doing something equivalent to this:

    double x=0, y=0, z=0;
    for(int i=0; i<n; i++) x += a[i]*a[i];
    for(int i=0; i<n; i++) y += a[i]*b[i];
    for(int i=0; i<n; i++) z += b[i]*b[i];
    return x - 2*y + z;

While this does return the same result (assuming infinite precision arithmetic) it is obviously not the way anybody would do it, since it's doing 3 times the work. Even worse, because their code is calling into the BLAS library for each loop, the compiler is explicitly prevented from optimizing the three loops and memory accesses by combining them into one. Note that the Haskell code is doing the efficient single loop. So yes, it is poorly written code.

ky3 · on April 8, 2013

There are 2 pieces of C benchmarked. The other one computes `a[i]-b[i]` into a temporary array and then makes a single call to BLAS. The speed is roughly half that of Haskell, which dispenses with the space cost of the temporary array.

jules · on April 8, 2013

Yes, you're right, the other C code is even less efficient.

ky3 · on April 8, 2013

Good catch! It's the 3x BLAS that's 1/2 of Haskell's speed.

The 1x BLAS that uses 2x RAM gets shafted, presumably by cache lossage.

army · on April 8, 2013

Because benchmarks are really hard to do fairly. There's a great deal of subjectivity in what is a "fair comparison" regardless, since there's always a bit more tuning you can do.

That said, a lot of benchmarks are just completely bogus. It's surprisingly common to see someone claim that they're beating the "state of the art" when they're comparing against a weak baseline. One problem is that people don't understand the systems they're benchmarking sufficiently well to obtain good results.

In this case, I have no doubt the results are much better than they've obtained in Haskell before - I think they've done an impressive job and I don't doubt the value of the work. But if the question is whether rewriting your C code in Haskell would make it go faster... well... I wouldn't bet the house on it.

raylu · on April 7, 2013

Well, yes. Nobody is going to write a post titled "C beats Ruby" with reasonable implementations of both. If you make a post in the other direction, it's often the case that you did something horribly wrong.

Paradigma11 · on April 8, 2013

Selection bias. The problem is you only hear about surprising results. Nobody writes about a benchmark where c++ beats python. Now the reason for such a surprising result may very often be poorly written code in the contender language.

taeric · on April 7, 2013

"The C++ programmer must worry about the performance implications of abstraction."

This quote seems a touch odd when you consider just how much faster the C++ solution is. The Haskell programmer definitely does not get a complete pass on worrying about the implications of abstraction. Indeed, this seems to underscore that if performance is a major criteria, than language choice still matters.

GhotiFish · on April 7, 2013

I agree, auditing Haskell's performance characteristics is not easy. Then again, nothing in Haskell is easy.

ufo · on April 7, 2013

While its true that lazy evaluation makes it harder to think about performance (specially memory use), the flipside is that it makes it easier to reason about programs at a higher level and you have more freedom when defining abstractions.

For example, in Haskell you can freely rewrite common subexpression with a let whenever you want and in whatever order you want.

    --original
    (f x) ... (f x)

    --let
    let y = f x in y ... y

On the other hand, if your language has side effects these sort of transformations are not always safe.

plinkplonk · on April 7, 2013

having side effects and having lazy evaluation are distinct features though. One can conceive of a strictly evaluated haskell-like which isolates side effects in monads and is still largely composed of pure functions.

ufo · on April 7, 2013

> having side effects and having lazy evaluation are distinct features though

Fair enough, but lazyness also plays nicer with things like `if` and `&&` that comput things lazily. (Without lazyness you need to add lots of wrapper anonymous functions).

Another thins is that, from a hystorical point of view, the only reason Haskell managed to be so pure in the first place was the lazyness. If your language is not lazy its very tempting to add sideeffects.

tome · on April 7, 2013

One can conceive of it, and yet it's never been done. Curious.

colanderman · on April 7, 2013

Yes it has. Mercury is a perfect example of a purely functional language which is not lazy. I/O is made functional via a uniqueness typing system.

tome · on April 7, 2013

Right, so it uses uniqueness typing and not monads presumably.

nilkn · on April 7, 2013

There is Idris: http://idris-lang.org/

It seems similar to what plinkplonk is talking about.

tome · on April 7, 2013

Hmm, I always thought of Idris, Coq and Agda as proof assistants rather than programming languages, but perhaps that's just me being unimaginative.

cyrus_ · on April 7, 2013

Idris is explicitly billed as a programming language. You can do pretty sophisticated programming in Coq too (Coq can be reasonably straightforwardly compiled down to Ocaml, Haskell and some other languages.)

tome · on April 7, 2013

I'd better spend some time checking it out then!

gtani · on April 7, 2013

what about disciple and mu ?

http://stackoverflow.com/questions/10784124/non-lazy-branch-...

http://anil.recoil.org/papers/2011-cufp-scribe-preprint.pdf

tome · on April 7, 2013

I'm not really sure how "pure" Disciple can be considered to be, and it doesn't use monads does it? Perhaps I'm being ignorant though.

Mu is a great example. I wish there was more information about it publically available.

miga · on April 7, 2013

Writing short and clear code in Haskell is definitely easy. As for optimization - it quite often does it for you too!

GhotiFish · on April 7, 2013

!!!

I have never met a single person who has said this, let alone capable of following through.

If you are capable, I am envious of your determination/natural talent. Or alternatively I and every programmer I know are idiots. Actually both seem like possibilities.

nilkn · on April 7, 2013

> Or alternatively I and every programmer I know are idiots.

Most programmers haven't taken the time to actually learn Haskell.

Haskell only seems hard at first because it incorporates a lot of ideas that other languages do not. For a lot of developers, learning Haskell is a little bit like learning to program from scratch all over again.

In that sense, yes, I guess it's hard, but so was whatever language you learned first.

mbel · on April 7, 2013

> For a lot of developers, learning Haskell is a little bit like learning to program from scratch all over again.

I find this to be the greatest joy of learning a new paradigm, everything is new and unknown, so even simple things are fun (again).

miga · on April 8, 2013

> Haskell only seems hard at first because it incorporates a lot of ideas that other languages do not. Indeed, even if you know Haskell, it may take years to master all classes in its standard library and most commonly used modules (like Parsec.)

swift · on April 7, 2013

I have to agree with the GP here. Haskell is actually an exceptionally simple, easy language to work with, once you get the basic concepts down. I have occasionally suffered some confusion in Haskell because of poorly documented third-party libraries, but the language itself is a joy.

viraptor · on April 7, 2013

I'd add a qualifier that if you know what you're writing and you're writing from scratch, it's relatively easy. Modifying existing programs can present an interesting challenge.

For example once you have a working program and decide that actually some value could be read/written using a file instead of computed at runtime and the place you use it is 10 levels deep after calling a non-IO function, you've got two choices. Unsafe IO (then why not use it everywhere) or a minor program redesign. Haskell is the only language where I run into this kind of situation. (it's still quite simple despite this)

mightybyte · on April 8, 2013

I'd actually argue that modifying existing programs is where Haskell really shines. Strong static types and purity allow you to know a LOT more about arbitrary bits of code than any other language in mainstream use today.

tome · on April 7, 2013

If your program is well factored then it probably won't be hard to switch to IO ten levels deep. Bonus: Haskell makes refactoring easy!

viraptor · on April 9, 2013

Yes, but what happens to apps which depend on this piece of code? Suddenly the signatures change and you affect everyone with a change which should be completely transparent for them.

tome · on April 9, 2013

It's hard to discuss without an example.

ghc · on April 7, 2013

Haskell has a difficult learning curve, that is all. A person coming to Vim from a modeless editor might feel the same way, but nobody doubts that someone familiar with Vim finds it easy to use. It took me a couple of years to "get" Haskell, but once you do it is very easy write clear, concise code with.

gtani · on April 7, 2013

!!! I'm sure you're not an idiot.

Have you tried the School of Haskell, it incorporates the minimum viable skills for learning haskell (pulling code apart in the REPL, and understanding type signatures, pattern matches and compiler type check errors)

https://www.fpcomplete.com/school/how-to-use-the-school-of-h...

CJefferson · on April 7, 2013

While it is interesting to see generalised stream fusion, the conclusions are that Haskell is still half the speed of C++.

Also, looking at the results most of the performance comes from which library you are using, and how well it is optimised. No compiler is good enough to do what the good libraries do.

gngeal · on April 7, 2013

"No compiler is good enough to do what the good libraries do."

That's nonsense. And if you take a look, say, at NumPy, which uses heavily tuned BLAS/LAPACK code, and Numexpr, which uses the "compile the snippet and run it" approach, you'll see why. Libraries won't save your performance if you can't optimize across procedure and module boundaries.

ori_b · on April 7, 2013

Which is why C and C++ under most state of the art compilers support link time optimization.

gngeal · on April 8, 2013

That won't help you much, if you, say, need the fusion of matrix operations I mentioned.

miga · on April 7, 2013

Stream fusion is an optimization framework that is (to my knowledge) defined in the library, but common to most sequence types in Haskell.

andrewcooke · on April 7, 2013

i followed (and enjoyed) most of that, but i don't get (perhaps because i've not used haskell in a long time) how the consumer is automatically(?) chosen.

from what i understand of section 3.2 the consumer is critical in selecting the correct stream type from the bundle. who decides that? does the user have to select the appropriate function? or does it some how fall out naturally from the use case? or does the compiler select it?

mainland · on April 7, 2013

The library writer is the "consumer" here. The programmer just uses the library, and the library chooses the proper stream representation.

Of course the programmer can also use the lower-level stream interface directly if desired, but then the programmer must also know which stream representation to choose.

33a · on April 7, 2013

If Haskell beats C on that loop, then that is a compiler bug. Did they try repeating these experiments using say icc?

kg · on April 7, 2013

Personally, I would say that leaping straight to 'compiler bug' is a little unwarranted. If you really think it's a compiler bug, you should compile the code (they provided it) using the settings they used (I think they provided that too; you can see them in a comment in this thread, actually) and then look at the output assembly and find the bug and file a bug against the compiler.

lallysingh · on April 7, 2013

Or Haskell semantics allows more aggressive optimization than C.

icc's not actually that much better than gcc in optimization, it's primarily a much better runtime library that gives icc its speed boost.