
Zero Cost Abstractions - steveklabnik
https://boats.gitlab.io/blog/post/zero-cost-abstractions/
======
thenewwazoo
An interesting post. I have two thoughts.

The first is in response to:

> I think to some extent Rust has tried to cheat this by making non-zero cost
> abstractions actively unpleasant to write, making it feel like the zero cost
> abstraction is better.

This idea is closely related to a comment by estebank just yesterday[1]. I
disagree - I think Rust makes non-zero-cost abstractions _obvious_ to the
author. If you want to do something in a way that's costly, the language
surfaces that information. It puts it right in your face and makes you deal
with it. That's a _good thing_. It makes for better code. If you want to use
dynamic dispatch, you can't pretend you're not.

The second is to add a place where I think Rust has gotten zero cost
abstractions right: zero-sized generic types. I use them extensively in
embedded code to build things like states. I can use trait methods and
associated consts in _really_ powerful ways to encode complicated
relationships that constrain the operation of a peripheral at compile time,
and it compiles down to a single register access. That's crazy powerful. Using
C structs and type punning can do the same thing, but without the safety.
Using C++'s classes can give me some type safety, but at the cost of real
method call overhead. I get giddy every time I think about writing something
like

    
    
        struct Adc<PIN: AdcPin, MODE: AdcMode>();
        
        let adc = Adc(adc_pin, AdcModes::8Bit);
    

and the compiler will let me know if my `adc_pin` isn't the right type.

[1]
[https://news.ycombinator.com/item?id=19920945](https://news.ycombinator.com/item?id=19920945)

~~~
ridiculous_fish
Why would a C++ version involve method call overhead?

Wouldn't C++ actually be more straightforward with its value-typed template
parameters?

~~~
thenewwazoo
With C++, wouldn't I need to either use a virtual base class with a typed
template parameter (to emulate interfaces), or need to use duck-typed
parameter and lose type safety?

~~~
gmueckl
Which of the two template arguments are you talking about? The enum-typed
template argument will lead to a optimized specialization of the template. C++
does, however, have an unfortunate problem with restricting type parameters to
types with certain properties/guarantees in ways that won't make the compiler
explode in obtuse error messages.

~~~
llukas
static_assert, enable_if or just plain not providing default implementation
and only specializations, as everything in C++ error messages is something you
can customize to high extent...

------
pron
There is one major problem with -- or, rather, cost to -- zero-cost
abstractions: they almost always introduce accidental complexity into the
programming language originating in the technical inner workings of the
compiler and how it generates code (we can say they do a bad job abstracting
the compiler), and more than that, they almost always make this accidental
complexity viral.

There is a theoretical reason for that. The optimal performance offered by
those constructs is almost always a form of specialization, AKA partial
evaluation: something that is known statically by the programmer to be true is
communicated to the compiler so that it can exploit that knowledge to generate
optimal machine code. But that static knowledge percolates through the call
stack, especially if the compiler wants to verify — usually through some form
of type-checking — that the assertion about that knowledge is, indeed, true.
If it is not verified, the compiler can generate incorrect code.

Here is an example from C++ (a contrived one):

Suppose we want to write a subroutine that computes a value based on two
arguments:

    
    
        enum kind { left, right };
     
        int foo(kind k, int x) { return k == left ? do_left(x) : do_right(x); }
    

And here are some use-sites:

    
    
        int bar(kind k) { return foo(k, random_int()); }
        int baz() { return foo(left, random_int()); }
        int qux() { return foo(random_kind(), random_int()); }
        
    

The branch on the kind in foo will represent some runtime cost that we deem to
be too expensive. To make that “zero cost”, we require the kind to be known
statically (and we assume that, indeed, this will be known statically in many
callsites). In this contrived example, the compiler will likely inline foo
into the caller and eliminate the branch when the caller is baz, and maybe in
bar, too, if it is inlined into its caller, but let’s assume the case is more
complicated, or we don’t trust the compiler, or that foo is in a shared
library, or that foo is a virtual method, so we specialize with a "zero-cost
abstraction":

    
    
        template<kind k> foo(int x) { return k == left ? left(x) : right(x); }
    

This would immediately require us to change _all_ callsites. In the case of
baz we will call foo<left>, in qux we will need to introduce the runtime
branch, and in bar, we will need to propagate the zero-cost abstraction up the
stack by changing the signature to template<kind k> bar(), which will employ
the type system to enforce the zero-cosiness.

You see this pattern appear everywhere with these zero cost abstractions (e.g.
async/await, although in that case it’s not strictly necessary; after all, all
subroutines are compiled to state machines, as that is essential to the
operation of the callstack — otherwise returning to a caller would not be
possible, but this requires the compiler to know exactly how the callstack is
implemented on a particular platform, and that increases implementation
costs).

So a technical decision related to machine-code optimization now becomes part
of the code, and in a very intrusive manner, even though the abstraction — the
algorithm in foo — has not changed. This is the very definition of accidental
complexity. Doing that change at all use sites, all to support a local change,
in a large codebase is esepcially painful; it's impossible when foo, or bar,
is part of a public API, as it's a breaking change -- all due to some local
optimization. Even APIs become infected with accidental complexity, all thanks
to zero-cost abstractions!

What is the alternative? JIT compilation! But it has its own tradeoffs... A
JIT can perform much more aggressive specialization for several of reasons: 1.
it can specialize speculatively and deoptimize if it was wrong; 2. it can
specialize across shared-library calls, as shared libraries are compiled only
to the intermediate representation, prior to JITting, and 3. it relies on a
size-limited dynamic code-cache, which prevents the code-size explosion we'd
get if we tried to specialize aggressively AOT; when the code cache fills, it
can decide to deoptimize low-priority routines. The speculative optimizations
performed by a JIT address the theoretical issue with specialization: a JIT
can perform a specialization even if it cannot decisively prove that the
information is, indeed, known statically (this is automatic partial
evaluation).

A JIT will, therefore, automatically specialize on a per-use-site basis; where
possible, it will elide the branch; if not, it will do it. It will even
speculate: if at one use site (after inlining) it has so far only encountered
`left` it will elide the branch, and will deoptimize if later proven wrong (it
may need to introduce a guard, which, in this contrived example will negate
the cost of the branch, but in more complex cases it would be a win; also
there are ways to introduce cost-free guards -- e.g. by introducing reads from
special addresses that will cause segmentation faults if the guard trips, a
fault which is caught; OpenJDK's HotSpot does this for some kinds of guards).

For this reason, JITs also solve the trait problem on a per-use-site basis. A
callsite that in practice only ever encounters a particular implementation — a
monomorphic callsite — would become cost-free (by devirtualization and
inlining), and those that don’t — megamorphic callsites — won’t.

So a JIT can give is the same “cost-freedom” without changing the abstraction
and introducing accidental complexity. It, therefore, allows for more general
abstractions that _hide_ , rather than expose, accidental complexity. JITs
have many other benefits, such as allowing runtime debugging/tracing "at full
speed" but those are for a separate discussion.

Of course, a JIT comes with its own costs. For one, those automatic
optimizations, while more effective than those possible with AOT compilation,
are not deterministic — we cannot be sure that the JIT would actually perform
them. It adds a warmup time, which can be significant for short-lived
programs. It adds RAM overhead by making the runtime more complicated.
Finally, it consumes more energy.

There's a similar tradeoff for tracing GC vs. precise monitoring of ownership
and lifetime (Rust uses reference-counting GC, which is generally less
effective than tracing, in cases where ownership and lifetime are not
statically determined), but this comment is already too long.

All of these make JITs less suitable for domains that require absolute control
and better determinism (you won't get perfect determinism with Rust on most
platforms due to kernel/CPU effects, and not if you rely on its refcounting
GC, which is, indeed _more_ deterministic than tracing GC, but not entirely),
or are designed to run in RAM- and/or energy-constrained environments — the
precise domains that Rust targets. But in all other domains, I think that
cost-free abstractions are a pretty big disadvantage compared to a good JIT --
maybe not _every_ cost-free abstraction (some tracking of ownership/lifetime
can be helpful even when there is a good tracing GC), but certainly the
philosophy of always striving for them. A JIT replaces the zero-cost
abstraction philosophy, with a zero-cost _use_ philosophy -- where the _use_
of a very general abstraction can be made cost-free, it (usually) will be (or
close to it); when it isn't -- it won't, but the programmer doesn't need to
select a different abstraction for some technical, "accidental", reason. That
JITs so efficiently compile a much wider variety of languages than AOT
compilers can also demonstrate how well they abstract the compiler (or,
rather, the languages that employ them do).

So we have two radically different philosohies, both are very well suited for
their respective problem domain, and neither is generally superior to the
other. Moreover, it's _usually_ easy to tell to which of these domains an
application belongs. That's why I like both C++/Rust _and_ Java.

~~~
kllrnohj
Or just use something like ThinLTO. You don't need a JIT just to get cross-
module inlining, which is all you're really talking about here. There's no
actual "specialization" in play, just basic inlining + dead branch
elimination. Trivial to do this statically at build time instead. So much so
this is even commonly done in JIT'd languages like Java (Proguard & R8 say
hello)

Most JITs also have the downside of they need to optimize for compilation
performance, and as such typically do not do as thorough a job optimizing as
something like offline LLVM. Or you end up with fourth-tier JITs like webkit's
FTL. And now instead of just needing to wait for the JIT to think your
function is interesting you now need to wait for a second and finally third
JIT pass.

JITs really primarily just shine in the rapid iteration cycle. Conceivably
what you really want is a language where debug builds are JIT and release
builds are full AOT with LTO. Not entirely unlike Dart's behavior in Flutter.
Or even how Java behaves on Android to an extent. There doesn't seem to be any
particularly significant reason why this pairing isn't more broadly applicable
and used.

There's also no real reason you can't have zero-cost abstractions _AND_ a JIT
if you want. Hell, in theory this is what you get with C++ in a webasm
environment. This are not exactly opposing technologies.

~~~
pron
> get cross-module inlining, which is all you're really talking about here.

Not at all.

> There's no actual "specialization" in play, just basic inlining + dead
> branch elimination.

Devirtualization + inlining + dead branch elimination _is_ an instance of
specialization, and an extremely effective one at that, but other forms of
specializations apply.

> Trivial to do this statically at build time instead.

No. Some things are undecidable and/or would lead to code explosion. E.g.,
it's often hard to statically prove that a subroutine that takes an int would
always receive the number 2, and specializing it for all ints would cause code
explosion (again, contrived example but with real-world counterparts).

> There's also no real reason you can't have zero-cost abstractions AND a JIT
> if you want. Hell, in theory this is what you get with C++ in a webasm
> environment. This are not exactly opposing technologies.

Of course. But I was talking about two very different philosophies that
greatly affect language design.

~~~
kllrnohj
> Not at all.

For your example? Yes, it was.

> Devirtualization + inlining + dead branch elimination is an instance of
> specialization

Your example did not involve any devirtualization.

> No. Some things are undecidable and/or would lead to code explosion. E.g.,
> it's often hard to statically prove that a subroutine that takes an int
> would always receive the number 2, and specializing it for all ints would
> cause code explosion (again, contrived example but with real-world
> counterparts).

Again, your example did not have any such issues.

And if it was hard to statically prove a subroutine that takes an int always
receives 2 then guess what? A JIT wouldn't do it, either. They don't actually
specialize that much for runtime information because gathering that
information would destroy performance. Devirtualization? Sure, limited in
scope, and huge payoffs to be had for doing it. All arguments analyzed &
binned? Good god no. That'd be ludicrous in resource usage required, and the
payoffs would be incredibly hard to quantify in any meaningful way.

Take your example - all it would do is eliminate a single branch. If it was so
common to take a single avenue that it could be eliminated, then pretty much
every branch predictor would do it anyway. So... why specialize it?

~~~
orf
V8 uses runtime type/value information to specialize methods in some cases[1].
I find it hard to believe that HotSpot doesn’t either.

Obviously complex objects are impossible to handle, but branching on an
integer that’s always the same value isn’t. Same for constant folding, if an
argument is determined to be constant at runtime then that’s going to affect
the choices the JIT makes.

1\. [https://ponyfoo.com/articles/an-introduction-to-
speculative-...](https://ponyfoo.com/articles/an-introduction-to-speculative-
optimization-in-v8)

~~~
kllrnohj
That's specializing on type, not on a given specific value. Being JS the value
influences the type, but it's only looking at a coarse binning (int vs. double
vs. string etc..) not specializing for a single specific value. A form of de-
virtualization if you will.

Hotspot wouldn't even need to bother with that most of the time since it's
primarily running statically typed languages for which that doesn't apply in
the first place.

------
JoeAltmaier
There are some fundamental optimizations we don't even think about any more.
E.g. the first class I wrote in C++ was a simple container for a byte, with a
constructor, a getter and a conversion to int. Then in main() I declared one
locally, constructed with 'true'. Then I returned that object as the return
code.

Imagine my surprise when the generated code was simply

mov AX, 0xFF

ret

Three or four bytes of code! So some abstractions, encapsulation for instance,
provide the user with guarantees but can cost absolutely nothing.

------
metafunctor
I really enjoyed the zero cost abstractions in Standard ML achieved by the
whole-program optimizing MLton compiler (and I have a t-shirt to prove it).
It's amazing how one can define clean, abstract, type-safe module hierarchies
that still boil down to very, very efficient machine code that's just inches
away from hand-written C code in terms of raw performance.

The zero runtime cost, of course, was paid in compilation time. With some
tricks, it wasn't a big problem in practice, though.

~~~
kristianp
Pardon my ignorance, but don't that family of languages suffer from the
overhead of boxing?

~~~
aeneasmackenzie
They can unbox sometimes, especially compiling with MLton or stalin. There's
research papers on specifying boxedness explicitly still in the MLish
framework but no real implementations afaik

------
xelxebar
Along these lines is an intermediate language called Morte[0], developed by
Gabriel Gonzalez, well known in the Haskell community. It seems to still be
getting some love.

The basic idea is to implement a strongly normalizing calculus of
constructions which languages can compile down to. That way, equivalent
functions provably generate equivalent code. Here's [1] a blog from 2014 that
goes into more detail on motivation, examples, and the like.

There's a proof-of-concept frontend called Annah[2]. I'm unaware of any
backends though.

[0]:[https://github.com/Gabriel439/Haskell-Morte-
Library](https://github.com/Gabriel439/Haskell-Morte-Library)

[1]:[http://www.haskellforall.com/2014/09/morte-intermediate-
lang...](http://www.haskellforall.com/2014/09/morte-intermediate-language-for-
super.html)

[2]:[https://github.com/Gabriel439/Haskell-Annah-
Library](https://github.com/Gabriel439/Haskell-Annah-Library)

------
vnorilo
"Zero cost" is a pet peeve of mine. While I concede that the concept is
useful, there is always a cost _somewhere_.

A very concrete example of this was a C++ unit test suite where templates were
used to generate several combinations of tasks/backends for verification. At
one point our cloud builders ran out of memory. To avoid upgrading the
instances at a €€€ cost from all the abstraction, I did some manual type
erasure.

So, zero cost typically only concerns run time characteristics. Even then
there are elusive pitfalls, like generated code size.

~~~
paulddraper
AFAIK, no one has even meant "zero cost" to mean "zero compile-time cost".

~~~
adrianratnapala
And yet it is called zero cost. Not "zero cost along one particular dimension
that we won't name and only sometimes matters."

~~~
steveklabnik
We took it from C++ which has a long history with the term; take it up with
Bjarne.

~~~
harry8
Sure he could. And he could take it up with anyone who copies it on a
greenfields project too, right? It's a crap name. So what, computing is full
of them, assembly language is full of them, C is full of them, C++ is full of
them. Sometimes we make a break and don't use one and invent something better.
Sometimes...

mov [memory], %register memove(dst, src, size); dst = std::move(src);

eg RAII, SFINAE these are not great names to be copied, I'd be surprised if
rust did.

------
jstimpfle
As I understand it, examples of "zero cost abstractions" would be std::string
or std::map<X,Y> in C++. Reimplementing those in plain C is not a great idea,
because the result will be very bug-prone, and will not be (significantly)
faster.

The problem with these "zero cost abstractions" is that we're measuring the
wrong cost.

The key to efficiency is optimizing _what_ is done, not optimizing each little
line of code. std::string is a terrible idea with regards to efficiency. If it
works for you chances are you should use Python instead. For anything complex,
std::string is slow and has a huge memory footprint (each string has its own
buffer, separately allocated from the heap). It also leads to a heavy increase
in compilation times. Anecdotally, it can even be painful to depend on it when
the standard library is upgraded. If you have anything beyond a few megabytes
worth of text data, std::string is about the worst possible technical
solution.

~~~
pdpi
zero-cost abstractions are abstractions where the hand-written code wouldn't
be any better than the abstraction.

An example is how Option<Box<T>> optimises down to a single pointer in Rust —
the compiler uses its knowledge of how Box<T> is non-nullable, and coopts the
null value to represent the None value for the Option.

~~~
blt
Is this optimization required by the language semantics or some standard (vs.
being only a description of mainline rustc's implemented optimizations)?

~~~
steveklabnik
It’s a defined part of the language semantics.

~~~
eridius
Is there actually a formal language semantics definition anywhere? I thought
all we had was stuff like [https://doc.rust-
lang.org/reference/index.html](https://doc.rust-lang.org/reference/index.html)
which is explicitly listed as "not a formal spec".

~~~
steveklabnik
Things have different degrees of guarantees. This is something we’ve
explicitly said is guaranteed.

~~~
kurtisc
Where are things like this said/defined/collated?

~~~
steveklabnik
We are generally at a point where if it's documented, it's guaranteed. This
feature is really easy to describe, and so we have a high degree of confidence
that when we say "this behavior is defined", we won't run into an edge case
where we'd have to break it.

So, in this case, it's a documented part of Option.

I realize this answer isn't _fantastic_ , but it's where we're at now. It's
why we're working on the reference this year!

~~~
kurtisc
>we're working on the reference this year!

I didn't know this, it's good to hear. Thanks!

------
bjoli
Lisp family of languages have macros which can provide zero cost abstractions
for you. For example, the loop facility is really just a macro, and so are the
racket for loops.

Not only that: in scheme there is usually a source->source optimisation phase
which means you can easily inspect what that abstraction does. That way it is
pretty simple to know that an optimisation really is zero cost.

My favourite examples are, as already mentioned, racket's for loops and the
various scheme pattern matches out there. The difference between lisp and rust
is that these can be added to your language without having to change the
compiler and they can be built as a library by anyone.

------
truth_seeker
I very much like to see this feature implemented at (Compiler + JIT) level in
higher level dynamic languages like Clojure, Erlang, ES8 JS, Python etc.

------
alschwalm
There is a typo under the 'optimal performance' section. "A zero cost
_abstractoin_ ought..."

~~~
galonk
Also at the end it says "not non-zero cost". Assuming the double-negative is a
mistake, they should delete either the "not" or the "non-".

------
IloveHN84
Nice post, but some code examples would have helped to clarify even further
the concepts, especially for newbies

------
choeger
Good luck with the trait object. I doubt you're going to significantly improve
on what the Haskell community has done in 20 years. If you do manage it
though, kudos in advance.

~~~
jadbox
Noob here: why is the trait object hard to optimize in principle?

~~~
uryga
(disclosure: i know Haskell but not Rust, and haven't actually implemented
anything like this in a compiler. please let me know if i'm wrong!)

best guesses:

\- dynamic dispatch usually involves carrying around¹ some function pointers
(to your type's implementations of the trait methods). this means that when
calling `x.foo()` on a trait object `x` of some trait `Foo`, the actual
function/method being called isn't known until runtime, so the compiler can't
optimize the call through things like inlining (which require knowing what's
being called). also AFAIK jumping to some unknown function pointer is
relatively slow on the CPU level

\- since the compiler doesn't know the actual type (and hence layout) of your
object, it has to be passed via reference, so there's some indirection
overhead and it's probably impossible to e.g. optimize away intermediate
values or be smart about the way they're kept in registers.

the problem is that the nature of trait objects (i.e. "i know nothing about
this value except that it implements the `Foo` trait) kind of seems to require
these things :(

¹ via a vtable or "dictionary passing"

------
crimsonalucard
The perfect target candidate for zero cost abstractions is a data base api.

With databases as the bottleneck of web development it is not optimal to use
such high level apis like SQL to optimize everything. We resort to ANALYZE and
strange sql hacks to trick the query engine into performing the right
algorithms when the api should explicitly let us choose to use or tweak both
zero cost abstractions and/or a query planners.

~~~
brianberns
I love SQL, but I strongly agree with this. It would be great to have a
platform-specific low-level API for performance-critical queries. (Much like
“unsafe” code in a managed programming language.)

~~~
jpgvm
I think the best analogy would be SQL is to OpenGL what this API would be to
Vulkan.

Ability to define the query procedurally or atleast semi-so vs the declarative
approach of SQL would allow direct optimisation of very performance critical
queries.

~~~
GrayShade
I disagree. The amount of optimization that a query planner does is vastly
more than a web application developer would care for.

We have people who can't be bothered to close database transactions (see that
Uber report). I wouldn't trust then to choose the optimal join strategy in a
complex query.

~~~
mikeyhew
What Uber report?

~~~
GrayShade
[https://eng.uber.com/mysql-migration/](https://eng.uber.com/mysql-migration/)

------
revskill
So, it's zero cost from the program performance standpoint. User MUST pay
something to have it.

It's not zero-cost from user standpoint.

~~~
saagarjha
How is the user paying for this? Generally the cost ends up being longer
compiles and more development time.

~~~
withoutboats
The user is the programmer in this context, the user of the programming
language, not the end user of the program.

