
Optimising Haskell for a tight inner loop - luu
http://neilmitchell.blogspot.co.uk/2014/01/optimising-haskell-for-tight-inner-loop.html
======
evmar
Ironically, the C++ code he's attempting to match itself uses a higher-level
language[1] to generate a soup of tables and gotos[2]. But I think that
observation supports the point he'd likely make: Haskell allows both the
higher-level expressions of problems that produce optimization opportunities
while preserving the option of dropping down to low-level trickiness if that's
what you really need.

[1]
[https://github.com/martine/ninja/blob/master/src/lexer.in.cc...](https://github.com/martine/ninja/blob/master/src/lexer.in.cc#L123)

[2]
[https://github.com/martine/ninja/blob/master/src/lexer.cc#L1...](https://github.com/martine/ninja/blob/master/src/lexer.cc#L130)

------
carterschonwald
I may take this blog post and turn it into a bug report for the ghc optimizer.
:-)

The moment you start trying to compete against hand written assembly, any high
level lang gets a bit quirky. One nice thing about ghc Haskell is you can
though! A lot of my own open source work is towards a vision of making that
much easier than it is today :-)

~~~
tenslisi
> The moment you start trying to compete against hand written assembly, any
> high level lang gets a bit quirky.

Yes, which is why I wonder why GHC went with it's own code generator instead
of using LLVM? It seems like a lot of the outputted assembly would be
coalesced with LLVM's optimizer passes. Is there work on this problem within
the Haskell world?

~~~
gnuvince
> Yes, which is why I wonder why GHC went with it's own code generator instead
> of using LLVM?

LLVM didn't exist when GHC was started.

~~~
thirsteh
It's pretty amazing: The initial release of GHC was in _1989_.

------
comex
Note that since all of the characters in question are less than 64, you could
also do a bit test between (1 << x) and a 64-bit mask, simplifying the control
flow; I'm not sure whether this would actually be faster, assuming the post is
correct that other characters that happen to be <= '$' are rare in practice,
but you could then load several bytes at a time into some SSE registers and do
the test in parallel, probably netting a substantial speedup at the cost of
becoming architecture dependent.

------
awda
By break0, it looks like we're just writing Haskell-ified C. (Lots of ugly,
similar safety guarantees.)

~~~
thirsteh
Indeed. Many of the high-performance libraries like bytestring, aeson, etc.
are written like this.

Haskell's performance is very good for nearly everything I write--and I don't
write anything nearly that low level--but I can't deny that this is one of the
uglier sides of the language.

On the other hand: In most other languages you'd just be dropping down to the
C ABI (that's easy to do in Haskell, too, for that matter.) It is kind of
impressive that you can write code that is as fast as C without writing C,
even if it is ugly.

~~~
1amzave
> _It is kind of impressive that you can write code that is as fast as C
> without writing C, even if it is ugly._

Impressive, perhaps -- but from a maintainability perspective I think I might
prefer to pay the cost of jumping the language barrier to be able to write
natural, comprehensible C rather than staying within one language and dealing
with an unreadable jumble, even if it performs as well.

~~~
thu
The final Haskell code, while being low-level, is not unreadable jumble. Maybe
you're talking about the generated Core or the C-- code.

------
userbinator
I wonder how a table lookup compares? Something like

    
    
         mov edx, @table
         xor eax, eax
        loop:
         mov al, [ecx]
         inc ecx
         cmp byte [edx+eax], 0
         jnz loop
    

I really wish Intel would optimise lodsb, scasb, and the rest of the string
instructions, since they're powerful and particularly suited for this sort of
scanning.

~~~
zvrba
I think it's more likely that they'll eventually re-purpose these opcodes
(further ISA extensions), and string operations will be optimized through
simple hardware matching of documented instruction sequences. Already today,
Intel's optimization manual gives SSE-optimized string search routines, which
consume 16 bytes at a time.

Removing and re-purposing instructions is not unprecedented; it has already
been done at 32->64 jump. (Prefixes, decimal instructions, etc.)

~~~
pbsd
I doubt those instructions (especially stos* and movs*, respectively matching
memset and memcpy) are going anywhere. Intel has even changed their semantics
in recent chips to work on 16-byte or larger chunks, and to have out-of-order
memory accesses. Current MOVSB, for instance, should be nearly as fast as
SSE/AVX optimized memory transfers.

~~~
userbinator
I wouldn't say their semantics have changed but certainly their
implementation. The fact that Intel used to recommend against using them but
now advises the opposite, suggests that scas/cmps (memchr/memset) could be
next. In any case, the thought of one machine instruction doing better than
lots of carefully hand-written assembly is nice.

------
jheriko
I know its not necessarily avoidable in practice due to legacy reasons, but my
main takeaway from this is that Haskell is not for high performance code.

