
Zero Cost Abstractions - Bognar
https://ruudvanasseldonk.com/2016/11/30/zero-cost-abstractions
======
mpweiher
>The only proper way to reason about the cost of these

>abstractions is to inspect the generated machine code.

To me, that's a big problem with a lot of these Heldencompilers. They _may_
generate really optimal machine code. Then again, they may not, and the
difference between optimizations working well and not working well in runtime
efficiency is so great (I've measured 1000x for Swift) that they might as well
be completely different languages.

For reference, 1000x means that 1 second turns into 16 minutes, and having
that type of difference in something that's completely opaque is not a useful
performance tool for me, because predictability is at least half the game in
performance. So something like Knuth's transformation systems that turn
optimization into a dialogue between programmer, compiler and instrumentation
seems like a better idea[1].

[1]
[https://www.cs.sjsu.edu/~mak/CS185C/KnuthStructuredProgrammi...](https://www.cs.sjsu.edu/~mak/CS185C/KnuthStructuredProgrammingGoTo.pdf)

~~~
colordrops
Heldencompiler? Google is not turning up much.

~~~
hunterwerlla
Probably supposed to be Heisencompiler as a play on Heisenberg's uncertainty
principal.

~~~
stcredzero
Is messing up branch prediction "Breaking Bad?"

~~~
Ygg2
No. See Heisenberg uncertainty principle.

------
etrain
This is pretty awesome. One key bit of information that the compiler has is
that the coefficients are a constant array of length 12, which makes the loop
unrolling possible and also means that the register magic is in play - it's
seriously awesome that the compiler does this.

That said, I'd expect something similar to happen with a well-written C
program. Would equivalent abstractions in C++1{1,4,7} be "costly"?

~~~
Manishearth
All of the optimizations were done by llvm, so nothing stopping this from
working in C++ with closures or C with functions annotated for inlining.

With C the lack of generics means that writing composable iterators is hard,
though.

~~~
adrianN
At least it's hard to write composable iterators that don't circumvent the
type system completely.

------
a_c
His snippet is very close to what one would do in scala. And I suspect it is
very similar to other functional languages. I have always wanted to have a
comparison between the generated code of different functional languages for a
similar program. One day..

~~~
k__
Yes, Rust feels a bit like a mix of Scala and C. Especially match :)

~~~
hyperpape
I think ML is actually the strongest influence. Of course, Scala and Haskell
both have ML influence, so there are things that look familiar from all three
languages.

------
eridal
OT and not sure if you're the author.. but what a beautifully designed
website!

~~~
aban
Not the author (Ruud), but I've been in touch with him.

The source code for his website is free software [0], so feel free to have a
look or adopt it for yourself.

Ruud uses a small static site generator he's written in Haskell, and in his
own words it “includes a tiny templating engine, an html and css minifier, and
an aggressive font subsetter.”

[0]: [https://github.com/ruuda/blog](https://github.com/ruuda/blog)

------
rawnlq
> But in any case, a missed opportunity for vectorisation is just that: a
> missed opportunity. It is not abstraction overhead.

Does the same code implemented in C manage to vectorize? If so isn't that an
actual "cost" in comparison?

------
sjolsen
How difficult is it to add a "zero cost abstraction" like the ones used here?
For example, what would it take to make this compile:

    
    
        for window in buffer.sliding_window(coefficients.len()) {
            let prediction = coefficients.iter()
                                         .zip(window)
                                         .map(|(&c, &s)| c * s as i64)
                                         .sum::<i64>() >> qlp_shift;
            let delta = buffer[i];
            buffer[i] = prediction as i32 + delta;
        }
    

with sliding_window returning an iterator over slices of the buffer?

~~~
jstimpfle
"Zero cost abstraction" to me means a completely "static" abstraction, i.e.
there are no runtime mechanisms _necessary_ to implement the abstraction.

Your sliding window is just a series of windows, so there is no reason why it
couldn't be compiled in the "most" efficient way. In fact it probably does,
just try it out by implementing the sliding_window as an iterable.

However there's always this problem with cleverness: It gets harder and harder
to read and maintain. Also
[http://wiki.c2.com/?SufficientlySmartCompiler](http://wiki.c2.com/?SufficientlySmartCompiler)

------
pron
> Fortunately these structures are not allocated on the heap, as they would
> likely be in a language like Java or Python.

At least in principle, escape analysis would be used to allocate them on the
stack. In HotSpot, simple iterators are commonly allocated on the stack (and
then optimized further). When you have a JIT (which means you can afford one),
general abstractions become zero-cost based on their use-site. The upside is
that you get a few, general and powerful abstractions that are then completely
optimized based on how they're used. The downside is that it is not
guaranteed, and a JIT requires more RAM and power (and in practice usually a
warmup period, although that can be solved).

------
forrestthewoods
I'm not familiar with some of these operations. I don't know what a Rust
"slice" is. Or what "Zip" does.

Could someone show me the most straight forward equivalent in vanilla C? I
assume there is no direct equivalent as temporary storage will be needed. But
that's fine and would further serve the purpose of explaining why Rust is
cool.

Thanks.

~~~
thenewwazoo
Uh, the C equivalent might be something like

    
    
        uint32_t *buffer = ...;
        uint64_t coefficients[12] = {...};
        uint16_t qlp_shift = ...;
    
        uint32_t *bufp = &buffer[...];
        uint64_t sum;
        for (size_t i = 0; i < 12; i++)
            sum += coefficients[i] * bufp[i-12];
        uint64_t prediction = sum >> qlp_shift;
        *bufp += (uint32_t)prediction;
    

Edit: this is incorrect

~~~
Manishearth
This is incorrect. The C code is more like

    
    
        for (size_t i = 0; i < 12; i++) {
            sum = 0;
            for (size_t j = 0; j < 12; j++)
                sum += coefficients[j] * bufp[i + j];
            buffer[i + 12] += sum >> qlp_shift;
        }

~~~
nkurz
I think this is still a little off. Maybe you used bufp[i + j] where you meant
buffer[i + j]? I may have something wrong too, but I probably would go with:

    
    
        for (size_t i = 0; i < 12; i++) {
            int64_t sum = 0;
            int32_t *bufp = buffer + i;
            for (size_t j = 0; j < 12; j++)
                sum += coefficients[j] * bufp[j];
            *bufp += (int32_t)(sum >> qlp_shift);
        }
    

I think this will put it in a form where the compiler should vectorize it[1],
probably making the loop about twice as fast as the version shown: 3 vector
loads, 3 vector multiplications, 3 vector additions and a horizontal sum
versus 16 scalar loads[2], 12 scalar multiplications and 12 scalar additions.

[1] I presume the "various reasons the above snippet is not as obvious to
vectorise as it might seem at first sight" is because the coefficients[] are
64-bit and buffer[] is 32-bit. But a smart compiler should be able to use
PMOVSXDQ (packed move sign extend double-word to quad-word) here. Unless there
is some other issue, I think the author may be underestimating the benefit of
vectorizing here.

[2] While the author is right that most of coefficients stay in registers,
there isn't actually room for all 12, so 4 of them are loaded from the stack
every iteration. Since only 3 128-bit vector registers are required, they
should remain there. I'm not sure if the excess loads are actually a
bottleneck, though.

~~~
openasocket
> I think this will put it in a form where the compiler should vectorize it

I played around with you snippet using the compiler explorer
[https://godbolt.org/](https://godbolt.org/) . I tried a couple versions of
clang, gcc, and icc and none of those were able to auto-vectorize it. Not sure
why. Kind of a shame, since there would definitely be a speedup.

~~~
nkurz
Thanks for testing it. You inspired me to try as well, and I've found that
Intel ICC 16 and 17 will vectorize it if you put "#pragma vector always"
before the inner loop. Without it, it thinks that the write to buffer[i] might
interfere with the reads of buffer[i + j]. I think this is because it's trying
to "jam" the inner and outer loop.

In addition, I discovered a couple other issues. I was embarrassed to see that
I'd messed up the loop bounds in my example. But the bigger one is that AVX2
and earlier do not have a 64-bit x 64-bit -> 64-bit vector multiplication
instruction. VPMULLQ was only added with AVX512, possibly because it requires
a 512-bit intermediate.

This means that with 64-bit coefficients, the vector approach has to do the
multiplications twice, and shift, and add. This is likely going to be slower
than the scalar approach. But if you were able to switch to 32-bit (or
floating point) coefficients, I think the vectorized solution looks pretty
good. ICC estimates a 1.4x speedup, but I think that gets better with longer
buffers.

I put the version with int32_t coefficients up here:
[https://godbolt.org/g/GXNVe2](https://godbolt.org/g/GXNVe2)

~~~
Ruud-v-A
Actually, the coefficients are only 16 bit, they are just widened before the
loop. AVX2 and later are not widely supported, but there is pmuldq in SSE4.1
which is appropriate here. It takes 32 bit operands and and produces a 64 bit
result. Because its result is 64 bit (as it should be by the way, because
truncating to 32 bit _will_ overflow on real-world FLAC files) it can only do
two elements per operation. A few adds can be vectorised too then, but you
still need a horizontal add to get the final value. I think it is possible to
get a small speedup, but definitely not the 4× or 8× that you might hope for
with SSE/AVX. But at this point I would say that this is a pretty advanced
optimisation, not your average "vectorise all operations in the loop,
increment the counter by 4/8 at a time" transformation that a compiler can do
easily. I am impressed that the Intel compiler can actually vectorise things
here.

------
fche
Would be curious to see a good C++ translation of that, which a modern
compiler can unroll/inline about as well.

~~~
kibwen
Given that the Rust compiler uses LLVM as its backend, and given that Rust's
closure implementation was inspired by C++'s (though with extra compile-time
machinery to make them memory-safe), I'm certain that C++ is just as capable
of boiling these abstractions away as Rust is. :) (But admittedly I don't know
if iterator adaptors like map and zip are in C++'s stdlib.)

~~~
Karliss
Not part of stdlib, but it can be done using Boost.
[https://godbolt.org/g/QBN5zP](https://godbolt.org/g/QBN5zP)

------
junke
I don't know anything about FLAC, but is the implicit modular arithmetic the
expected behavior here? What if a product or a sum overflows?

~~~
Ruud-v-A
Good question! It turns out that 64 bit arithmetic does not overflow (I
explain this in a comment in the source code [1]) up to the shift. The shift
amount must then be large enough to fit the result in 32 (or actually, 24 or
usually 16) bits. For a valid FLAC file this will be the case. For an invalid
FLAC file it might truncate, but an invalid file cannot be decoded properly
anyway. The shift amount is a 5 bit signed number [2], so it is never possible
to shift by the integer width or more.

[1]:
[https://github.com/ruuda/claxon/blob/91b6af9/src/subframe.rs...](https://github.com/ruuda/claxon/blob/91b6af9/src/subframe.rs#L476-L481)
[2]:
[https://github.com/ruuda/claxon/blob/91b6af9/src/subframe.rs...](https://github.com/ruuda/claxon/blob/91b6af9/src/subframe.rs#L576-L579)

~~~
junke
Thanks for the detailed answer.

------
dfrey
Why no constant for the value 12? :(

------
kutkloon7
I like the idea of zero-cost abstractions very much. I think that eventually,
we will move to functionally proven code - which is, of course, also a zero-
cost abstraction, since functional verification is normally done at compile-
time.

The snippet presented is completely unreadable to me though, and I think that
in general, Rust is too hard to understand (and has some syntax which seems
quite arbitrary).

~~~
Manishearth
> The snippet presented is completely unreadable to me though

I suspect this is just a matter of being used to things. For me, the
corresponding imperative code is harder to disentangle. Whereas, knowing what
zip/map/sum do, both the intent and the behavior of the code is abundantly
clear to me.

~~~
khedoros1
I suspect that's right. I can describe what zip, map, and sum do, but I don't
have an intuitive feel for what each one does or the patterns that they're
usually used in because I haven't ever really done any functional programming.

I've seen side-by-side imperative and functional comparisons, written in Rust.
Reading through the functional version of the code always gave me a very rough
idea of what was happening, and reading the imperative version was always
immediately clear to me. I think that it's almost certainly a function of
familiarity.

~~~
yazaddaruvala
Have you ever learned a new spoken language? At first, the learner understands
by translating to primary language. Only later, with repeated experience does
the learner stop doing so and become "fluent".

Its very similar when learning the functional style of programming. Even today
having most of this functional style be intuitive, when I see a complex
iterator chain (usually only when constructing a `Map`), I need to mentally
unfold it into a loop.

~~~
khedoros1
> Have you ever learned a new spoken language? At first, the learner
> understands by translating to primary language.

That seems like a good analogy. I underwent that process a few years ago while
teaching myself to read assembly, and in the past, other spoken languages.

