
On Competing with C Using Haskell (2017) - signa11
https://two-wrongs.com/on-competing-with-c-using-haskell.html
======
vvanders
> So when people say "C-like performance" in this sort of context, what they
> really mean is "the performance if we choose to make a foreign call to a
> function written in C instead of writing the same function in Haskell".

That's not really why we use C(or C++ or Rust) for high performance code. The
author got really close by calling out branch prediction overhead and reading
1mb from memory.

However he completely missed the overhead of a cache miss. We use C because
you can be _explicit_ about your memory layout so that when you read memory in
a predictable pattern(linearly forward/backward) the prefetcher can keep ahead
of you and avoid cache misses. You're looking at 400-1200 cycles per cache
miss which can happen on each reference access if you're not careful.

I don't know how rich Haskell's support for Value vs Reference types and
memory layout are but this is something you can't just wave a compiler at
since it's highly dependent on your dataset and use cases.

~~~
tathougies
> I don't know how rich Haskell's support for Value vs Reference types and
> memory layout are but this is something you can't just wave a compiler at
> since it's highly dependent on your dataset and use cases.

Haskell has both value types (unboxed types) and reference types. By default,
lazy reference types are the default. If you add a '!' before the types of
fields in a record, you get direct pointers. If you add a '! {-# UNPACK -#}'
or enable the '-funbox-strict-fields' option, you get directly embedded value
types.

That being said, C has more first-class support for low-level manipulation,
but if you wanted to, you could just add this kind of manipulation onto
Haskell as a library. In my last company, we developed a library that used
higher-kinded polymorphism and phantom types to basically make low-level
memory manipulation a 'first class' value that could be reasoned about using
Haskell's type system.

However, this required that you annotate your types appropriately.
Nevertheless, code would end up compiling into highly efficient direct memory
and register manipulation, while still having all of Haskell's typical type
safety.

~~~
vvanders
Interesting, thanks for the detailed breakdown.

How's support for directly mapping chunks of memory to packed structures? We
used to do a lot of in-place loading in games(yay for spinning, physical
disks) so it being able to do blind casts was almost as important as value
types themselves.

~~~
tathougies
Haskell allows unsafe pointer arithmetic and manipulation (of course, the
functions are all named 'unsafe'!). But, you can write safe interfaces on top
of it. I have an example of an external sort using the standard fibonacci
merge algorithm, which uses raw pointers and memory mapped IO under the hood,
but which exposes a safe API. I've done some basic benchmarking, and it's
about as fast as Postgres's sort on my machine, for simple binary doubles and
ints.

Here it is:
[https://github.com/tathougies/extsort/blob/master/Data/Exter...](https://github.com/tathougies/extsort/blob/master/Data/External/Sort.hs)
.

I think it's a good example of 'low-level' Haskell.

------
tathougies
In his explicit recursion example, the author questions why the compiler
doesn't convert it into a tail call. The issue here is that the speed he
gained in the tall call version had nothing to do with the tail call. It has
to do with the fact that the accumulating parameter is strict. The explicitly
recursive version is lazy, which means there is a thunk structure of (1 + 1 +
... ) being built up in memory, which is then evaluated. This overhead of
building up this thunk, and then evaluating it (awful cache characteristics)
is what he's seeing. By adding a bit of strictness, the author could likely
get the optimizer to convert it into a strict tail call.

~~~
gowld
[https://wiki.haskell.org/Foldr_Foldl_Foldl%27](https://wiki.haskell.org/Foldr_Foldl_Foldl%27)
explains this in detail.

------
glangdale
I am not convinced that advancing over two different string buffers and
comparing their contents - even with a variable amount of advance due to
character encoding - would have a branch miss in it if programmed by a
competent programmer in any language.

I am also not convinced that this problem can't be completely clobbered by
SIMD, regardless of platform (x86 vs Neon) and language (Haskell, C,
whatever).

A comparison of languages in this case for performance (outside the kiddy
division) would either (a) center around which language could get the hell out
of the way and let the programmer use intrinsics or (b) which language could
actually allow the programmer to express SIMD concepts in a nice language-
friendly way.

I have occasionally seen some pretty cool stuff from FP folks in the latter
direction, which, if brought to production quality, would allow FP people to
absolutely slaughter naive C code.

~~~
tathougies
There is a summer of code project this year to make GHC output SIMD code from
standard Haskell code. The llvm backend does this now, but it comes at the
cost of being less efficient on normal Haskell code, and also unable to access
all the rich type information that GHC theoretically could.

Nevertheless, GHC has all SIMD intrinsics available for use, and they're all
type checked, as you'd expect: [https://hackage.haskell.org/package/ghc-
prim-0.5.2.0/docs/GH...](https://hackage.haskell.org/package/ghc-
prim-0.5.2.0/docs/GHC-Prim.html#g:29)

------
toolslive
In OCaml, there's the option to directly call the native function without any
setup cost or overhead, in fact, some of the standard library's function are
offered this way. For example:

    
    
        external sqrt : float -> float = "caml_sqrt_float"
         "sqrt" [@@unboxed] [@@noalloc]

~~~
anonymouz
The overhead that is being talked about in the blog post is mostly the
impossibility to inline/optimize the C code of the callee, setting up the
stack frame, etc.

How would OCaml avoid that?

~~~
toolslive
It does a reasonable job:

    
    
        external my_sqrt : float -> float = "caml_sqrt_float"
           "sqrt" [@@unboxed] [@@noalloc]
    
        let () =
          let x= my_sqrt 1.0 in
          Printf.printf "x = %f" x
    

becomes something like:

    
    
        ...
        camlInline__entry:
                .cfi_startproc
                subq    $8, %rsp
                .cfi_adjust_cfa_offset 8
        .L102:
                movsd   .L103(%rip), %xmm0
                sqrtsd  %xmm0, %xmm0
                movsd   %xmm0, (%rsp)
                movq    camlInline__6@GOTPCREL(%rip), %rbx
                movq    camlPervasives@GOTPCREL(%rip), %rax
                movq    208(%rax), %rax
                call    camlPrintf__fprintf_1291@PLT
        ....

------
ncmncm
Competing with C is a pretty low bar, nowadays; although I appreciate the
uniquely careful definition used here.

When you are serious about performance, nowadays, cache misses and branch
prediction are decreasingly relevant. You need to talk about drawing on
computational resources practically inaccessible from C, including AVXn vector
units, and shaders.

C's weakness for expressing abstractions is a fundamental and unbridgeable
handicap for such computation. C++ might be up to the job, and Haskell should
be. It is not clear whether Rust is. For this kind of hardware exploitation,
direct compiler support will always lag far behind. You need for the language
to be expressive enough to support writing libraries that wrap access to the
platform's hardware features in a way that the library can be ported to other
hardware without changing client code, and present its capabilities in a way
natural for the language.

Compiler optimizers are good at merging client code with library
implementation when both are in the same language, and can identify
efficiencies not available to library or clients taken in isolation.

It is not typically less work to write such a library than to tailor a
compiler, but the work has better longevity. Other people port the library for
their hardware, and all the compilers and programs are brought along.

In the case of rustc and ghc, where there is momentarily only one compiler, it
may be harder to see this.

