

Perf for low level Haskell profiling - psibi
https://www.fpcomplete.com/user/bitonic/perf-for-low-level-profiling

======
CUViper
> mov instructions from register to register make up for more than 60% of the
> time spent in the critical section of the code, while we would expect most
> of the time to be spent xoring and anding. I have not investigated why this
> is the case, ideas welcome

If you're not using precise events, then the instruction addresses reported by
perf will have some skid. This is a small cpu delay from when a performance
counter overflows to when the interrupt actually freezes state.

You can choose precise sampling for some events, depending on the CPU. Try "-e
cycles:pp" for instance.

    
    
        0,09 │      mov    (%rax,%r8,4),%eax
       29,32 │      mov    %r14,%r8
    

I think this first mov from memory is likely to be your true cycle eater, much
more so than the second mov reg-reg or any single xor/and operations. But
don't optimize based on my hunch - measure it precisely first! If memory
access proves to be your slowdown, then you can try optimizing your access
patterns.

~~~
wyager
Also, because of how complicated modern pipelining is, some instructions that
you wouldn't expect to take a long time do take a long time (usually because
they're waiting on a mov from memory to finish). In this case, the mov from
memory could be throwing up a hazard that's blocking the mov from register.

------
th0br0
I wonder how many veteran C programmers (myself not included) would react with
"d'oh, of course you should byte-align memory access" here...

~~~
danieldk
I was the author of the _digest-pure_ package ('author' is a big word, since
it is an extremely small amount of code). I would like to mention that my goal
was to make simple, pure, reference implementations of crc32 and adler32 in
Haskell. The _digest_ package already provided performant implementations of
the same functions using zlib. In fact, it is pointed out in the description
of the _digest-pure_ package that you should use digest if you want
performance.

It turned out to be a useful exercise, since adler32 in digest gave incorrect
checksums, due to initialization to 0 rather than 1:

[https://github.com/jkff/digest/pull/1](https://github.com/jkff/digest/pull/1)

Anyway, thanks to Francesco Mazzoli for the really nice article!

