Before you say 'oh, Haskell is still only half the speed of C++', you should keep in mind just how insane the C++ code they had to write looks. The Eigen code they had to write is pretty nasty and getting it correct required some particular understanding.
In comparison, the performance they're getting out of Haskell is coming from comparatively mundane, easy to understand code. That's pretty impressive.
Or another way of putting it: It looks like naive Haskell outperforms naive C++, and naive Haskell is not that much slower than carefully-tuned C++.
While 'Derived::Scalar' could very well be replaced by 'double' in that case, you need the templated function to catch the temporary type (expression template) that represents the u-v computation. This lets Eigen compute u[i]-v[i] and the dot product inside the same inner loop.
Without the templated function, the compiler would first generate the subtraction into a temporary vector, and then compute the dot product on that vector.
edit: I see, they address this in the article: In this specific case the Eigen
library has the squared L2-norm in its API, but in more complicated
examples there might be no such built-in provision
if i understand the paper correctly (and apply the knowledge to eigen), it't because of the same reason: you want to allow different stream fusion strategies to work.
The code (actual instructions) that beats C++ in their example is being generated by their experimental "stream fusion" framework. And NOT by a generic Haskell compiler.
And now the question, do you really want to an unproven and unstable Haskell framework? While there are stable alternatives (lake Eigen, Theano, etc) available?
So no. Haskell (production ready Haskell) doesn't even come close.
Although the results are impressive, the title is a bit optimistic. Haskell beat GCC slightly in the first benchmark. Haskell beat C in the second benchmark because the C code is very poorly written. They did not compare with the obvious C implementation. Secondly, GCC isn't the best C compiler, even though the paper claims that it is the best compiler that "we could find". They should have used ICC. C++ is also 4x faster than Haskell in the second benchmark, giving further indication that the C is slow just because it is poorly written. So it's hard to justify such a title. The work itself is very good however.
> C++ is also 4x faster than Haskell in the second benchmark, giving further indication that the C is slow just because it is poorly written.
That would imply that the C version of the C++ code would be just as fast, if only it were not "poorly written".
However there are some types of code (including this kind) where having C++ available to generate the code for you makes it almost inherently faster than C.
You could certainly reimplement specific matrix operations for very specific types in C, but the C++ Eigen library does this all for you at compile time, and does it the best way for each combination of types and operations.
If you do this in C, even with macros everywhere, it's going to be much more difficult, especially since you can't (AFAIK) declare and discard temporary types within macros, though C11 at least allows you to refer generically to types within macros. The canonical example for this kind of disparity is C's qsort() vs. C++ templated sorts.
This is kind of the point to the paper. Performance of this sort is only possible in C++ if you're careful to follow the instructions provided with Eigen about using templated functions, and some very smart devs had to write Eigen in the first place. The Haskell code comes fairly close starting from the source code you'd write naïvely, and without tons of optimization on the stream fusion code generator.
The first C example is pretty disingenuous as they explicitly added loop unrolling and prefetching tuned to x86-64 to the Vector library, then compared the basic C versions with with "-funroll-loops" (and no prefetching based on the icc/gcc/clang compiler output I looked at). For some reason they also use "-msse4.2" on a CPU supporting AVX while the Haskell version is generating AVX instructions.
The paper would have been better off with more complex examples (e.g. functions that can't be trivially implemented in asm by hand with no register pressure) and fewer comparisons to implementations that could be doing what they are, but aren't.
First author here. The dot product example was compiled with GCC 4.7.2 -O3 -msse4.2 -ffast-math -ftree-vectorize -funroll-loops; see the caption to Figure 5. What compiler options would you have suggested? The Haskell version only used SSE instructions, not AVX; this should have been made clear in the paper.
The more complex examples are in Section 5.2; see Figure 8. Granted, we would have liked to have done more, but deadlines are deadlines...
Figure 7 shows Haskell-generated AVX instructions, albeit only using the lower 128 bits. That code would not run on an SSE4.2-capable Nehalem, for instance.
There are some other CPU-related slight inaccuracies in the paper. Prefetching is repeatedly mentioned, even though its effect is negligible when one has a perfectly linear memory access pattern; unaligned loads are mentioned as a performance hit, but they are essentially free on the test processor (2600k, Sandy Bridge).
Matrix multiplication would perhaps be a better example to show the power of clever prefetching.
The Haskell assembly is using the 3-op AVX instructions, but not 256bit registers. The example is simple enough that it may not make a difference (2-op vs 3-op instructions), but in a more complex function with register spills it could have. ("-mavx" should be enough).
I would be surprised if prefetching helped at all, given that the memory access pattern is completely linear. But yes, it's always a problem that the work in a paper is tuned to the particular machine the benchmark is run on, while the things that are compared against are not. Also agreed that it's suspicious that Haskell is using AVX while the C isn't. However compared to the second benchmark, the first one is very fair ;-)
Perhaps somebody with an AVX capable processor can redo the two benchmarks with correct compiler options (i.e. -mavx) and/or with ICC, and replacing the C code of the second benchmark with code that is equivalent to the Haskell code:
double s=0;
for(int i=0; i<n; i++) s += pow(a[i]-b[i],2);
I found out gcc supports -fprefetch-loop-arrays, although it is not guaranteed to have a positive effect. In this case, it does appear to run faster than without prefetching. AVX versions also run faster than the standard SSE counterparts even with 128 bit registers. ymm regs are faster than xmm.
gcc is faster than icc actually, and icc runs faster with the plain c version than the vectorized versions. I can't figure out how to make icc prefetch either, the switch that controls it doesn't seem to have an effect.
GHC and GCC are 2 of the most widely used compilers for the respective languages. The research uses most popular compilers, which are used by many different projects, not the most micro-optimized once.
I am curious why every time there is a "language" beats "other language" in some test posted on here, there is an inevitable slew of comments along the lines of "losing language was poorly written code".
Is the probability really so high as to be 100% that all optimization tests were against a poorly written competitor? That's amazing if that's the case.
It seems to me that most X beats Y posts are written by people with a great deal of knowledge of X, but without a matching knowledge of Y. This, unfortunately, leaves every post wide open to this sort of attack.
I would wager a large contributor to this is that the idioms of some of these specialized fields do not generalize to other use cases.
So, while some researchers have gotten it such that a relatively straight forward implementation in a new language with new tools is better than the old way, the folks working the old way have not exactly stagnated.
This seems especially likely when you consider that for many of the really fast implementations, it is not that they are using no abstractions. The abstractions they are using are very close to the machine, because the machine is the abstraction they are using. (That make sense?)
I suspect this comment is at least partially directed at me. Did you read the paper? Excluding some O(1) operations, the goal of the second benchmark is this:
double s = 0;
for(int i=0; i<n; i++) s += pow(a[i]-b[i],2);
return s;
But they did not use this C code. The C code they used made 3 calls to the BLAS library. The end result is that the C code is doing something equivalent to this:
double x=0, y=0, z=0;
for(int i=0; i<n; i++) x += a[i]*a[i];
for(int i=0; i<n; i++) y += a[i]*b[i];
for(int i=0; i<n; i++) z += b[i]*b[i];
return x - 2*y + z;
While this does return the same result (assuming infinite precision arithmetic) it is obviously not the way anybody would do it, since it's doing 3 times the work. Even worse, because their code is calling into the BLAS library for each loop, the compiler is explicitly prevented from optimizing the three loops and memory accesses by combining them into one. Note that the Haskell code is doing the efficient single loop. So yes, it is poorly written code.
There are 2 pieces of C benchmarked. The other one computes `a[i]-b[i]` into a temporary array and then makes a single call to BLAS. The speed is roughly half that of Haskell, which dispenses with the space cost of the temporary array.
Because benchmarks are really hard to do fairly. There's a great deal of subjectivity in what is a "fair comparison" regardless, since there's always a bit more tuning you can do.
That said, a lot of benchmarks are just completely bogus.
It's surprisingly common to see someone claim that they're beating the "state of the art" when they're comparing against a weak baseline. One problem is that people don't understand the systems they're benchmarking sufficiently well to obtain good results.
In this case, I have no doubt the results are much better than they've obtained in Haskell before - I think they've done an impressive job and I don't doubt the value of the work. But if the question is whether rewriting your C code in Haskell would make it go faster... well... I wouldn't bet the house on it.
Well, yes. Nobody is going to write a post titled "C beats Ruby" with reasonable implementations of both. If you make a post in the other direction, it's often the case that you did something horribly wrong.
Selection bias.
The problem is you only hear about surprising results. Nobody writes about a benchmark where c++ beats python. Now the reason for such a surprising result may very often be poorly written code in the contender language.
"The C++ programmer must worry about the performance implications of abstraction."
This quote seems a touch odd when you consider just how much faster the C++ solution is. The Haskell programmer definitely does not get a complete pass on worrying about the implications of abstraction. Indeed, this seems to underscore that if performance is a major criteria, than language choice still matters.
While its true that lazy evaluation makes it harder to think about performance (specially memory use), the flipside is that it makes it easier to reason about programs at a higher level and you have more freedom when defining abstractions.
For example, in Haskell you can freely rewrite common subexpression with a let whenever you want and in whatever order you want.
--original
(f x) ... (f x)
--let
let y = f x in y ... y
On the other hand, if your language has side effects these sort of transformations are not always safe.
having side effects and having lazy evaluation are distinct features though. One can conceive of a strictly evaluated haskell-like which isolates side effects in monads and is still largely composed of pure functions.
> having side effects and having lazy evaluation are distinct features though
Fair enough, but lazyness also plays nicer with things like `if` and `&&` that comput things lazily. (Without lazyness you need to add lots of wrapper anonymous functions).
Another thins is that, from a hystorical point of view, the only reason Haskell managed to be so pure in the first place was the lazyness. If your language is not lazy its very tempting to add sideeffects.
Idris is explicitly billed as a programming language. You can do pretty sophisticated programming in Coq too (Coq can be reasonably straightforwardly compiled down to Ocaml, Haskell and some other languages.)
I have never met a single person who has said this, let alone capable of following through.
If you are capable, I am envious of your determination/natural talent. Or alternatively I and every programmer I know are idiots. Actually both seem like possibilities.
> Or alternatively I and every programmer I know are idiots.
Most programmers haven't taken the time to actually learn Haskell.
Haskell only seems hard at first because it incorporates a lot of ideas that other languages do not. For a lot of developers, learning Haskell is a little bit like learning to program from scratch all over again.
In that sense, yes, I guess it's hard, but so was whatever language you learned first.
> Haskell only seems hard at first because it incorporates a lot of ideas that other languages do not.
Indeed, even if you know Haskell, it may take years to master all classes in its standard library and most commonly used modules (like Parsec.)
I have to agree with the GP here. Haskell is actually an exceptionally simple, easy language to work with, once you get the basic concepts down. I have occasionally suffered some confusion in Haskell because of poorly documented third-party libraries, but the language itself is a joy.
I'd add a qualifier that if you know what you're writing and you're writing from scratch, it's relatively easy. Modifying existing programs can present an interesting challenge.
For example once you have a working program and decide that actually some value could be read/written using a file instead of computed at runtime and the place you use it is 10 levels deep after calling a non-IO function, you've got two choices. Unsafe IO (then why not use it everywhere) or a minor program redesign. Haskell is the only language where I run into this kind of situation. (it's still quite simple despite this)
I'd actually argue that modifying existing programs is where Haskell really shines. Strong static types and purity allow you to know a LOT more about arbitrary bits of code than any other language in mainstream use today.
Yes, but what happens to apps which depend on this piece of code? Suddenly the signatures change and you affect everyone with a change which should be completely transparent for them.
Haskell has a difficult learning curve, that is all. A person coming to Vim from a modeless editor might feel the same way, but nobody doubts that someone familiar with Vim finds it easy to use. It took me a couple of years to "get" Haskell, but once you do it is very easy write clear, concise code with.
Have you tried the School of Haskell, it incorporates the minimum viable skills for learning haskell (pulling code apart in the REPL, and understanding type signatures, pattern matches and compiler type check errors)
While it is interesting to see generalised stream fusion, the conclusions are that Haskell is still half the speed of C++.
Also, looking at the results most of the performance comes from which library you are using, and how well it is optimised. No compiler is good enough to do what the good libraries do.
"No compiler is good enough to do what the good libraries do."
That's nonsense. And if you take a look, say, at NumPy, which uses heavily tuned BLAS/LAPACK code, and Numexpr, which uses the "compile the snippet and run it" approach, you'll see why. Libraries won't save your performance if you can't optimize across procedure and module boundaries.
i followed (and enjoyed) most of that, but i don't get (perhaps because i've not used haskell in a long time) how the consumer is automatically(?) chosen.
from what i understand of section 3.2 the consumer is critical in selecting the correct stream type from the bundle. who decides that? does the user have to select the appropriate function? or does it some how fall out naturally from the use case? or does the compiler select it?
The library writer is the "consumer" here. The programmer just uses the library, and the library chooses the proper stream representation.
Of course the programmer can also use the lower-level stream interface directly if desired, but then the programmer must also know which stream representation to choose.
Personally, I would say that leaping straight to 'compiler bug' is a little unwarranted. If you really think it's a compiler bug, you should compile the code (they provided it) using the settings they used (I think they provided that too; you can see them in a comment in this thread, actually) and then look at the output assembly and find the bug and file a bug against the compiler.
In comparison, the performance they're getting out of Haskell is coming from comparatively mundane, easy to understand code. That's pretty impressive.
Or another way of putting it: It looks like naive Haskell outperforms naive C++, and naive Haskell is not that much slower than carefully-tuned C++.
Why is it a template function? Why does it return Derived::Scalar instead of double? Dear god, Eigen is magic.