
Iterators and Streams in Rust and Haskell - psibi
https://www.fpcomplete.com/blog/2017/07/iterators-streams-rust-haskell
======
iainmerrick
I feel like the author has felt obliged to include the full results, which is
noble, but it's mostly obscuring the interesting results.

What does it matter if the "cheating" versions are faster, since they're doing
something completely different? (OK, in principle it could be the same with an
unrealistically magical optimizer.)

Seems to me the key point is that a bunch of high-level constructs in both
Rust and Haskell are very nearly as fast as a tight loop in C. That's great!

The versions that are much slower don't seem very surprising, as they involve
boxing the ints. _(Edit to add: OK, reading more closely, I guess 'Haskell
iterator 5' is interesting to dig into.)_

~~~
ozataman
No idea why people reacting here so far got fixated on the "cheating" versions
- it's clear to me they were included mainly to set a maximal speed
baseline/benchmark and are not the main point of the article.

~~~
runeks
I find the "cheating" versions peculiar because I don't see the purpose of it.
What's the point, and in what way is it cheating? It's just a different
algorithm, and doesn't add any useful information to the subject at hand.

~~~
chriswarbo
Numerical operations in a loop are often subject to aggressive optimisation by
C compilers, which makes them tricky to use in benchmarks: are we measuring
the intended loop, or has the work been optimised away? Often comparisons are
made of "Blub vs C", where the C result is an order of magnitude smaller, and
it's not clear if that's because C is fast or whether it's been optimised
away.

Including an "optimised away" version lets us know when this has happened: the
"non-cheating" benchmarks take much longer than the "cheating" ones, so we can
assume they've not been optimised away.

I assume the author only went into detail about them because they're
independently interesting, regardless of the main topic of the post.

~~~
runeks
The solution to this is to deliver arguments at runtime, rather than baking
them into the program as constants. Describe some computation by a data
structure that is delivered at runtime, and see which implementation does
best. That way there can be no cheating.

~~~
chriswarbo
> The solution to this is to deliver arguments at runtime, rather than baking
> them into the program as constants.

Functions already take their arguments at runtime. Except when they don't, due
to optimisation.

For a benchmark to be automated, reproducible, etc. those constants have to be
baked in _somewhere_ , even if it's in a Haskell program using FFI (as in the
article), or a shell script, etc. Whilst optimisers don't (yet) cross the
language/process boundary, it still makes sense to include such sanity checks,
rather than assuming we know what the optimiser will/won't do.

After all, the whole point of a benchmark is to gather evidence to
question/ground the assumptions of our mental model. The less we assume, the
better. The more evidence we gather, the better.

------
runeks
I find it really interesting that the idiomatic Haskell implementations
(basically math) are the best-performing, while the Rust-like Haskell
implementation (using an IORef) is orders of magnitude slower. This is exactly
what I want: describe the logic of the operation, and leave the compiler to
optimize for the hardware (in this case a CPU which mutates registers). The
Rust implementation makes assumptions about the underlying hardware (has
registers we can mutate), and is only about 15% faster than the Haskell
implementation which makes no such assumptions.

In essence, this is why I love Haskell, and choose it over Rust: it allows me
to write my application logic directly, without having to think about it in
terms of mutation, and have the generated code be pretty fast. If GHC becomes
well-optimized enough it can render Rust obsolete, since "no runtime overhead"
becomes pretty meaningless if it's actually slower than Haskell (e.g. using
LinearTypes, which removes need for GC). Rust can't render Haskell obsolete,
however, since Haskell's goal is basically allowing you to write logic
directly, using types as propositions and values as proofs. So Haskell's goal
is a qualitative one (execute logic) while Rust's is a quantitative one
(performance/no runtime overhead), which results in Haskell being able to take
the place of Rust if GHC gains sufficiently in performance.

~~~
adwn
> _If GHC becomes well-optimized enough it can render Rust obsolete [...]_

Ah yes, the mythical Sufficiently Smart Compiler (TM), scheduled to arrive any
day now. Will it also be able to reliably transform non-trivial functions that
are accidentally O(n) in memory consumption due Haskell's laziness, to O(1)
versions?

The keyword here is _reliably_. There are many clever but brittle compiler
optimizations that aren't used much in the field, because when performance is
crucial, you can't rely on the compiler to detect this transformation
opportunity in some cases and not in some other cases, so you code it
explicitly.

~~~
chongli
_reliably transform non-trivial functions that are accidentally O(n) in memory
consumption due Haskell 's laziness, to O(1) versions?_

What you're describing is called strictness analysis. [0] I believe that
attempting to solve it in general runs up against the halting problem. That
being said, GHC does perform some strictness analysis when optimization flags
are turned on. [1] The language itself also includes strictness annotations on
the fields of data constructors. [2] Experienced Haskell programmers know how
to use these to avoid allocating unnecessary thunks.

[0]
[https://en.wikipedia.org/wiki/Strictness_analysis](https://en.wikipedia.org/wiki/Strictness_analysis)

[1]
[https://wiki.haskell.org/Performance/Strictness#Strictness_a...](https://wiki.haskell.org/Performance/Strictness#Strictness_analysis)

[2]
[https://wiki.haskell.org/Performance/Data_types#Strict_field...](https://wiki.haskell.org/Performance/Data_types#Strict_fields)

------
pklausler
> But look again: C is taking 87 nanoseconds, while Rust and Haskell both take
> about 175 microseconds. It turns out that GCC it able to optimize this into
> a downward-counting loop, which drastically improves the performance. We can
> do similar things in Rust and Haskell to get down to nanosecond-level
> performance, but that's not our goal today. I do have to say: well done GCC.

Downward-counting or not, it is simply impossible for GCC to generate code
that executes all 1,000,000 iterations of the loop in 87ns. That would be 87
femtoseconds per iteration, on average.

More likely, GCC figured out how to collapse the entire loop into a closed-
form expression that is a function of the loop length.

~~~
gizmo686
I just tested this on my machine (gcc 5.4.0). At -O2, gcc produced normal
looking assembly code. At -O3, gcc produced a monstrosity [0] that I don't
feel like fully deciphering.

However, from a brief glance, it does not appear to have created a closed form
solution. Instead, it contains a single loop:

    
    
        .L4:
            addl    $1, %edx
            paddd   %xmm1, %xmm0
            paddd   %xmm2, %xmm1
            cmpl    %edx, %eax
            ja  .L4 
    

which seems to be using a SIMD instruction (paddd[1]) that adds does 4 32-bit
integer additions in parallel.

After this loop, it does some "housekeeping" (read, something I don't
understand) before proceeding to an unwound version of the last iterations of
the loop:

    
    
        leal    4(%rdx), %ecx
        addl    %edx, %eax
        cmpl    %ecx, %edi
        jl  .L2 
        addl    %ecx, %eax
        leal    8(%rdx), %ecx
        cmpl    %ecx, %edi
        ...
        jl  .L2 
        addl    %ecx, %eax
        leal    28(%rdx), %ecx
        cmpl    %ecx, %edi
        jl  .L2
        addl    %ecx, %eax
        addl    $32, %edx
        leal    (%rax,%rdx), %ecx
        cmpl    %edx, %edi
        cmovge  %ecx, %eax
        ret
    

Where .L2 is just:

    
    
        .L2:
            rep ret
    

I assume that this is just some form of return, but the documentation I could
find [2] seems to suggest that rep is a prefix for string operations, which
doesn't make sense.

[0][https://pastebin.com/raw/Y55gQG7p](https://pastebin.com/raw/Y55gQG7p)

[1]
[http://x86.renejeschke.de/html/file_module_x86_id_226.html](http://x86.renejeschke.de/html/file_module_x86_id_226.html)

[2]
[https://c9x.me/x86/html/file_module_x86_id_279.html](https://c9x.me/x86/html/file_module_x86_id_279.html)

~~~
gpderetta
the rep in rep ret is ignored, is just used for alignment; the 'housekeeping'
code is to handle non-multiple of 8 loop counts.

Still, unless I'm missing something, the code should be executing 8 adds per
clock[2]; at 4ghz, that still above 1us for 500k adds.

GCC doesn't seem to be able to fold the loop given a constant expression,
unless the function is explicitly declared constexpr; in which case it will
complain about the accumulator overflowing, but gcc doesn't seem to be taking
advantage of it.

Clang does not vectorize the loop but will replace it with a constant given a
constant parameter.

Bottom line, I'm not sure what's going on with the article's measurements.

[2] potentially 12 for skylake or even 24 with avx.

