
Compiler Optimizations are Awesome - turingbook
https://blog.regehr.org/archives/1515
======
jcranmer
Several years ago, I happened on a blog post where someone demonstrated a very
fast variant of the N-queens solution based on hand-written, SSE vectorized
that was presented as very fast. I managed to write a faster, non-vectorized C
solution that was recursive and that the compiler couldn't vectorize, and much
easier to understand than the vectorized original it was based off of.

Turns out that main reason the vaunted "overkilled" solution was so abysmally
slow was that the author happily used BSF and BTC in the hottest part of the
loop... which are actually rather slow instructions, particularly when you're
using them to control a branch (compare-and-jump is a fused µop in practice,
but BTC-and-jump is not).

The point of this tail is that if you want to absolutely wring the last clock
cycle out of a hot path, you usually need good microarchitectural knowledge
about which operations are going to be faster and which are not. Sure, you can
beat a compiler with hand-written, hand-optimized assembly code most of the
time--but the people who have the skills to write such code are going to be
the people working on the compilers.

The tools for optimizing compilers are getting better, and probably faster
than we are capable of pumping out performance engineers to hand-craft the
inner loops. In the past decade, we've seen polyhedral loop transformations
become production-quality. Auto-vectorization is getting better, particularly
when user-directed (think #pragma openmp simd); I know Intel has been pushing
"outer-loop vectorization" very hard in the past few years. The other big
fruit on the horizon is superoptimizers: I suspect we'll see superoptimizers
shipping in production compilers within a decade or two.

~~~
gcp
_the author happily used BSF and BTC in the hottest part of the loop... which
are actually rather slow instructions_

Depends on the hardware you're running on. They used to be very fast on Intel
machines. (There's an argument here that C survives architecture changes
better than the ASM alternative, but BSF/BSR don't have C equivalents that the
compiler can translate if appropriate, AFAIK. The best generic, widely fast
implementation I know of uses multiply-shift with de Bruijn sequences and
isn't automatically translated for obvious reasons)

~~~
bluGill
That is the point: I can spend a couple months hand optimizing a hot spot, but
for what CPU? The optimizations I'd do for an intel core i7 are different and
intel core i9 - and these are generations of the same architecture try to
support AMD ryzen and things are very different. Now add in an ARM CPU
(which?)...

If I can find a better algorithm I can of course beat the compiler on all of
the above. However does such an algorithm exist?

------
scraft
Is anyone else in games development here? If we are looking to run the game at
60 FPS, we have 16.67 ms per frame to do everything required to run the game.
Because of this real time requirement, a decent amount of profiling is
typically done on each game. I typically see that the frame time is getting
split up over a whole array of different sections of the game, i.e.:

\- Calculating skeleton animations (updating bone positions, sometimes
skinning vertices too)

\- Clipping geometry in the scene (finding out what things are inside/outside
the camera frustum, etc.)

\- Processing game logic, things like AI can be quite costly, so much is game
dependent

\- Walking through all the geometry that needs drawing and issueing draw calls

\- Decompressing streaming audio and sending it to a sound driver buffer/queue

\- Stepping the physics world (integrating positions/rotation working and
resolving out intersections, etc.)

The difference between a non optimized, and an optimized build, is often 5 FPS
and 60 FPS, and optimizing a single hot file or function would not get the
game running anywhere near 60 FPS. I think the idea that optimizing compilers
aren't required is completely laughable, but then again I only have one
perspective from the games development scene - maybe someone else will reply
and say they make AAA games in C/C++ and don't need compiler optimizations :)

~~~
mwkaufma
AAA dev here. The performance gains mostly come from designing data, not code.
Factoring an array of structures into a structure of arrays and accessing
linearly, e.g., makes better use of the processor cache and saturates
throughput. Code optimization mostly gives us the confidence that there aren't
unnecessary accesses or branches interleaved that might bust the cache.

~~~
saosebastiao
> Factoring an array of structures into a structure of arrays and accessing
> linearly, e.g., makes better use of the processor cache and saturates
> throughput.

This is an example of a provably correct optimization, and it's a shame that
more compilers do not optimize this situation. Wouldn't it be awesome to model
the data the way that makes sense for your domain and let the compiler worry
about cache locality?

~~~
zurn
Jonathan Blow's Jai language lets you do this by annotating structs, along
with some other assisted memory optimizations (pointer compression is
another).

~~~
AstralStorm
I suspect GCC could do it too for invisible or fully inlined symbols.

This is possible in C99 and C++11 because type punning is forbidden in
general. But that would need a special flag to specify incompatible internal
ABI.

------
lukego
I often think about Proebsting's Law: Compiler Advances Double Computing Power
Every 18 _Years_. Sure, optimizing compilers are nice to have, but maybe their
complexity is disproportionate to their benefit?

I love the way Dynamo [1] is able to reproduce many of the benefits with a
fraction of the complexity by doing some of the optimizations at runtime with
simpler algorithms. Can we use this approach to "garbage collect" some of the
complexity embodied in humongous projects like LLVM?

[1] Dynamo:
[https://people.cs.umass.edu/~emery/classes/cmpsci691s-fall20...](https://people.cs.umass.edu/~emery/classes/cmpsci691s-fall2004/papers/bala00dynamo.pdf)

~~~
Joky
You're mentioning "disproportionate complexity" for current compilers and at
the same time advocating for a runtime system modifying and caching the code
on the fly? Ouch...

Note also that I don't believe that Dynamo "reproduce many of the benefits
[...]", since the paper you linked benchmarked running Dynamo _on top_ of the
O2 generated code (by a ~20y old compiler). It isn't clear if they ripped
"low-hanging fruit" that static compilers are catching nowadays. Also the same
thing on top of a PGO+LTO build may be a more fair comparison.

~~~
lukego
The exciting possibility, from my perspective, is that many optimizations may
be simpler to implement dynamically (JIT) than statically. So perhaps you can
have 10% of the compiler code to get 90% of the benefit. I see this as the
basic premise of LuaJIT.

Motivating example: The CPU can predict whether branch instructions will be
taken with uncanny accuracy. This is achieved using simple dynamic heuristics.
I believe it would be much more challenging for the compiler to predict these
branches statically.

~~~
sanxiyn
> I believe it would be much more challenging for the compiler to predict
> these branches statically.

Why do you believe so? "Branch prediction for free" is a compiler literature
classic, and was published in PLDI 1993.

~~~
lukego
Thanks for the reference.

I would be interested in a comparison of modern static vs dynamic predictors.
Do you know of one?

The misprediction rates that I see in practice are much lower than the ~20%
that they are talking about in this paper. For example, gcc compiling sqlite3
misses only ~3.8% of branches on my Haswell processor (see below.)

I attribute this improvement to dynamic predictors using more information like
recent history, return stack context, loop counter value, computed branch
address, etc. This information seems beneficial and also challenging for a
static predictor to use.

    
    
      $ perf stat -e branches,branch-misses gcc sqlite3.c
      12,212,167,816      branches:u
         463,727,878      branch-misses:u  #    3.80% of all branches

~~~
nickpsecurity
"modern static vs dynamic predictors"

That again. People jump right to static _versus_ dynamic. Best to mix static
and dynamic analysis to get best of both worlds. It was already done in both
whole-program optimization and security to get some great benefits. My classic
default is to feed the common case as test runs into an instrumented version
of the system to get execution traces and measurements. That is combined with
static analysis to provide better transformations or scheduling. Also, this is
a high-level description of a solution that applies to many problems rather
than just compiler components.

------
petters
It should be quite easy to see the value of optimizing compilers. Compile your
program with optimizations turned off. Now make it as fast as your release
build again, while still keeping them off. For much of my code, I think this
would take years.

~~~
nullc
You may be significantly underestimating your ability to get phenomenal gains
from algorithmic improvements. I think DJB would argue that the compiler
optimizations are insignificant in comparison to those improvements for a very
large portion of hot code.

Of course, even if I'm correct you could turn the optimizations back on and be
faster still. One reason why you might not want to do that is the cost of
increased miscompliation.

~~~
DannyBee
" I think DJB would argue that the compiler optimizations are insignificant in
comparison to those improvements for a very large portion of hot code."

Except he has no data to show this, and every piece of data i have (and my
colleagues have), says that he is wrong.

Additionally, the compilers can and do replace algorithms if users let them.
Most users _don 't_ want it, even if it makes code faster.

Remember that the vast majority of people want software that works and is
reliable, or has fast development cycles, or a million other things, and
happily choose that over software that is fast. Compilers changing algorithms
on you isn't highly compatible with this, especially when, for example, most
programming languages don't even give you a good way to express the invariants
you are trying to maintain (except good ol' Ada).

However, besides that, certainly you can see that in the end, the people lose.
It's like arguing that computer would never win at go.

If i really really really cared about some piece of hot code, i wouldn't pay
an expert to optimize it, i'd throw the cycles of 100k spare cpus at
optimizing it.

That is also the other strange part of his argument to me, he assumes the
capabilities of compilers based on compilers he sees that are meant to operate
pretty much single threaded on single machines, and complains "they will never
beat people". It's like looking at the machine learning algorithm that takes
months or years or roughly forever to train on a single cpu and say "it'll
never do a good job". The world is not that small anymore.

It's blindingly obvious the optimizing compilers win if we want them to.

In any case, this particular debate will restart again as gpus and
accelerators follow the same old compiler cycle.

~~~
oconnor0
> Additionally, the compilers can and do replace algorithms if users let them.
> Most users don't want it, even if it makes code faster.

What compilers do this? With what algorithms?

~~~
jerven
If you look at modern JIT compilers such as Truffle-Ruby, then yes it will
change algorithms. e.g. changing sort algorithms depending on data size. if
the data is small enough it will do a bubblesort on registers, when its big it
will do TimSort on heap data etc...[1]

[1]: [http://chrisseaton.com/rubytruffle/small-data-
structures/](http://chrisseaton.com/rubytruffle/small-data-structures/)

~~~
coldtea
This example is about the compiler changing between (existing) internal
versions of sorting algorithms for language data structures.

Not about changing what algorithms the user coded.

~~~
danielbarla
Exactly, that's two completely different things. It's akin to a compiler using
the famous "switch statement becomes if-else chain if less than 7 items,
otherwise lookup table".

The parent posts are referring to macro-level optimisations, where large
changes to how a system works are implemented, in order to get massive gains
in performance. Picking out the essence of the larger implementation would be
a difficult task for a compiler, but more importantly it wouldn't have enough
information about the context of the application to make optimisation
decisions. Things like: do we need to read from that file each time we need
this piece information, or is it a once-off that can be cached? Unless this
contextual information is expressed somehow (and it basically never is), only
the programmer will be able to make the change.

~~~
jerven

        def clamp(num, min, max)
          [min, num, max].sort[1]
        end
    

being turned into

    
    
        def  clamp(num, min, max)
           if (num < min)
           {
             if (num < max) 
                 return num; 
             else 
                 return max;
           }
    
           if (min < max)
             return min;
          else return max;
       end
    

Seems to me to be algorithm rewriting. As is superword optimization and
partial evaluation. All things that Truffle-Ruby + Graal do today.

~~~
danielbarla
Sure, and at this level, I would also argue that compilers are as good as
humans at optimisation (likely better than almost all, actually). These
arguments were constantly going on in the 80s and 90s, where (assembly-
enthusiast) people vehemently argued that machines (C compilers) will never
optimise things as well as humans. While the argument still exists in some
form today, it's certainly died down a lot.

That said, in my own limited experience, everyday optimisation problems tend
to exist at a much higher level, on the high-level approach or "what are we
doing" level. Perhaps these are "obvious" to some, or below the level of
discussion here. But essentially, a compiler can optimise away endlessly at a
piece of code, but will never beat code that shouldn't exist in the first
place. My comment above was that a compiler has insufficient information to
make decisions about what's truly wasteful or useless.

As an example, no compiler today will come up with a Courgette update [1] by
itself. And the day it does, I think we can pack up our bags and go home
(hopefully in a nice, comfy retirement kind of way).

[1] [https://www.chromium.org/developers/design-
documents/softwar...](https://www.chromium.org/developers/design-
documents/software-updates-courgette)

~~~
wolfgke
> These arguments were constantly going on in the 80s and 90s, where
> (assembly-enthusiast) people vehemently argued that machines (C compilers)
> will never optimise things as well as humans.

I still argue this way. But these kinds of optimizations are tedious and time-
intensive (thus costly). Also dependent on the architecture, i.e. when some
extensions (say new SIMD instructions) are added to the instruction set, the
compiler sometimes is able to use them automatically (though typically in a
very sub-optimal way), while hand-optimized assembly code has to be rewritten
(costly). Also if you port your program to a new CPU architecture (say Intel
-> ARM or ARM-A32 -> ARM-A64) hardly anything can be reused. Finally tight
hand-optimized code tends to be much harder to add features to than more high-
level code (say, C code).

So I believe these vehemently arguing people are right (and have always been).
But this does not contradict the fact that in many cases a very suboptimal
code generated by the (say C) compiler is often fast enough and C is much more
economical to use.

------
fizixer
Great talk by DJB.

IMO he couldn't give a convincing answer to the guy who asked about LuaJIT
author being out of a job. But there's a clear answer. JIT authors are not out
of job not because optimizing compilers are not dead, but because they're
writing compilers, their distinguishing ability is writing "pre-compiled"
code.

You might say, "well a JIT author sped up your code's execution so he/she is
writing an optimizing compiler". Well you have to realize that, traditionally,
JIT authors don't just translate the code into object code, they also apply
these things called "compiler optimizations". The point is that if they didn't
do that, and simply produced a faithful translation of the code, they would
still make the code faster because of pre-compilation (and if they enabled the
"compiler optimizations", the code wouldn't run significantly faster than the
simply pre-compiled code).

Regardless of whether I agree with it or not, "Optimizing compilers are dead"
is not the same as saying "JIT authors will be out of business". (Even
compiler writers won't be out of business).

~~~
samth
I was that guy in the audience.

Your suggestion is that a templating JIT that just drops in some machine code
that matches the method, doing no optimization, would get all the win. Such a
compiler is indeed much faster than an interpreter, but it's nowhere close to
an optimizing compiler. Mike Pall, the author of LuaJIT, would be very
surprised if you suggested that his compiler performed similarly to something
simple like that.

~~~
mikemike
Actually, LuaJIT 1.x is just that: a translator from a register-based bytecode
to machine code using templates (small assembler snippets) with fixed register
assignment. There's only a little bit more magic to that, like template
variants depending on the inferred type etc.

You can compare the performance of LuaJIT 1.x and 2.0 yourself on the
benchmark page (for x86). The LuaJIT 1.x JIT-compiled code is only slightly
faster than the heavily tuned LuaJIT 2.x VM plus the 2.x interpreter written
in assembly language by hand. Sometimes the 2.x interpreter even beats the 1.x
compiler.

A lot of this is due to the better design of the 2.x VM (object layout, stack
layout, calling conventions, builtins etc.). But from the perspective of the
CPU, a heavily optimized interpreter does not look that different from
simplistic, template-generated code. The interpreter dispatch overhead can be
moved to independent dependency-chains by the CPU, if you're doing this right.

Of course, the LuaJIT 2.x JIT compiler handily beats both the 2.x interpreter
and the 1.x compiler.

~~~
nkurz
HN is an astonishing thing!

Article: "We can also refute Bernstein’s argument from first principles: the
kind of people who can effectively hand-optimize code are expensive and not
incredibly plentiful."

Commenter: "IMO he couldn't give a convincing answer to the guy who asked
about LuaJIT author being out of a job."

Guy in audience: "I was that guy in the audience."

LuaJIT author: "Actually, LuaJIT 1.x is just that"

Voice in my head: "Aspen 20, I show you at one thousand eight hundred and
forty-two knots, across the ground."

Meta: Apologies for the abstract response, but I couldn't figure out a better
way to present the parallel. It can be hard to explain artistic allusions
without ruining them. What I mean to say is that this pattern of responses
reminded me in a delightful way of the classic story of the SR-71 ground speed
check:
[http://www.econrates.com/reality/schul.html](http://www.econrates.com/reality/schul.html)

------
CJefferson
Having used some language with awful compilers, compiler optimisations let me
write cleaner code.

In languages with bad optimisers I have to worry about separating code in a
hot loop out into a function -- the cost of a function call is too high. This
one in particular I find can lead to some horrible code, as functions grow
larger and larger and lots of cutting+pasting happens to avoid function call
costs.

On a smaller note, making sure I cache the values of function calls which
won't change -- when instead I could trust the compiler to know the value
won't change and the the caching itself.

~~~
gruez
There isnt some forcelinline attribute in gcc?

~~~
JoachimSchipper
Your parent is probably not talking about gcc, or even about C; gcc does
indeed have __attribute__((always_inline)).

------
lmm
> If an optimizing compiler can speed up code by, for example, 50%, then
> suddenly we need to optimize a lot less code by hand.

This doesn't follow at all. If you had one hot loop and a bunch of cold code,
and auto-optimize your code to be a measly factor of 2 faster, you're still
going to need to hand-optimize the hot loop and what it does to the cold code
is irrelevant.

> hand-optimized code has higher ongoing maintenance costs than does portable
> source code; we’d like to avoid it when there’s a better way to meet our
> performance goals.

True, but again, only applies if you can optimize by enough to make hand-
optimization unnecessary.

> we’d also have to throw away many of those 16 GB phones that are cheap and
> plentiful and fairly useful today.

This part is nonsense. No-one's got anything like 16GB of _code_ on their
phone.

Optimization could be valuable but current compilers are too opaque, making
optimization too much of a black art. I believe we need to do something along
the lines of "turning the database inside out" (
[https://www.confluent.io/blog/turning-the-database-inside-
ou...](https://www.confluent.io/blog/turning-the-database-inside-out-with-
apache-samza/) ); we should turn the compiler inside out, build it as more of
a library, give the developer more insight into what's going on, have a high
level language that lets you understand how it compiles. Interesting and
vaguely along the same lines: [https://www.microsoft.com/en-
us/research/publication/coq-wor...](https://www.microsoft.com/en-
us/research/publication/coq-worlds-best-macro-
assembler/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-
us%2Fum%2Fpeople%2Fnick%2Fcoqasm.pdf) .

~~~
tom_mellior
> auto-optimize your code to be a measly factor of 2 faster, you're still
> going to need to hand-optimize the hot loop

Huh? You start with performance goals. If compiler-optimized code meets your
performance goals, you are done. You do not "need to hand-optimize" in that
case.

Why would a "factor of 2" _not_ be good enough? What is it compared to? What
makes you so sure that you _must_ optimize further, disregarding any possible
context?

~~~
w0utert
Additionally, if a factor 2 improvement by the compiler on top of a factor 4
improvement by using better algorithms and data structures can give you an 8x
improvement overall, why would you not take it?

Reading through the various comments asserting that optimizing compilers
should not be necessary if you 'just use better algorithms' and whatnot, I'm
kind of wondering why it has to be one thing or the another. Who wouldn't want
to have both?

~~~
lmm
Because using a better algorithm is more like a factor of 100 or 1000. Factors
of 2 are just not worth bothering with; if the business is so marginal that a
factor of 2 in the hardware requirements makes a noticeable difference then
you don't want to be in that business in the first place.

~~~
w0utert
That makes no sense at all. Only if you presume you are always going to start
with some horrible piece of code that only uses the worst possible algorithms
you may get that kind of speedup. And after you got your 100x to 1000x
speedup, or if you start with a piece of code that already is quite efficient,
it could still be 2x too slow to meet some hard realtime performance
constraint. Or it could be fast enough, but you could potentially save 2x
computing power that can be used for other tasks, just by letting the compiler
optimize your code.

If I tell my employer I'm not interested in getting a 2x speedup by flipping a
compiler switch, but prefer to spend an uncertain time looking for or
inventing this hypothetical algorithm that will give me 100x to 1000x speedup,
I don't think they'll be very happy about that. For most of the code we
deliver to our customers, a factor 'so marginal as 2x' could _literally_ save
them millions of dollars over the lifetime of the software. Two times faster
saves a lot of computing time if you have jobs that run for 72+ hours. If only
things were as easy as you pretend they are (we already _have_ the 'marginal'
2x from the compiler optimizations, and we already _have_ efficient
algorithms)...

~~~
lmm
> Only if you presume you are always going to start with some horrible piece
> of code that only uses the worst possible algorithms you may get that kind
> of speedup

If we're talking about algorithmic improvements we're usually talking about
going from O(n^2) to O(n log n) or similar, which is easily a 100x-1000x or
more speedup if your n is big enough to be bothering with performance at all.
It's very easy to be accidentally quadratic and this causes a lot of
performance bugs. Talking as though algorithmic speedups and optimization
speedups are of a similar magnitude is really misleading.

> it could still be 2x too slow to meet some hard realtime performance
> constraint

Theoretically it could be, but that would take an incredible amount of luck.
There's so much variation in runtime that the odds of getting within a factor
of 2x are tiny; more likely you're thousands of times too slow, or thousands
of times faster than you need to be.

> Two times faster saves a lot of computing time if you have jobs that run for
> 72+ hours.

Sure, if you've got jobs that run for 72 hours in your code (i.e. aren't
spending those 72 hours doing something that's already highly optimized in a
standard library e.g. linear algebra) and have already carefully optimized
your choice of algorithms, then the rules are different. That's a pretty rare
case though.

------
zurn
TL;DR "Optimizing compilers are still good to have because they are cheaper
than programmer labour needed for hand optimization"

The original DJB presentation, which this is a response to, is very good and
interesting.

It would really be nice if the field of compiler engineering started to
address the obvious neglected areas, like optimizing memory layout and data
types/representations based on partial evaluation / profile feedback.

~~~
DannyBee
" like optimizing memory layout and data types/representations based on
partial evaluation / profile feedback."

They already can. The issue is usually one of what is allowable within
language semantics, not of compiler optimization technology.

One reason you may see this as neglected is that outside of, say, polyhedral
loop optimizations, research in the 80's and 90's (and sometimes earlier) did
a _really really_ good job of exploring this area because fortran allowed so
much freedom.

~~~
moomin
Indeed. C# optimises struct memory layout unless told not to, C and C++ don't
and can't.

~~~
kccqzy
I know C and C++ can't because the language standard forbids it. Anyone knows
the reason why? It should be hard but doable. That's, of course, assuming
everyone properly uses macros like offsetof instead of (shudder) hand-
calculated offsets, does not use type punning to access the first members of
structs, etc. The compiler will certainly need to store the memory layout to
enable separate compilation of translation units so that's a complication. Any
other issues I haven't thought of?

~~~
nhaehnle
Actually, the C (and I believe C++) language standards both allow reorganizing
struct layouts, except for the first member which has to be in first place
(for tagged unions).

The struct layout is fixed by platform-specific ABI documents, which must
specify the layout precisely so that structs can be passed between separately
compiled code, e.g. between applications and dynamically linked libraries.

Changing ABIs is virtually impossible, so...

It would be nice to see opt-in optimizations of struct layouts, e.g. by
annotating structs with a packing-style attribute. Though the best gains would
probably be obtained from profile-guided optimizations for cache line
optimizations, and those can only really be done with link-time optimizations
for structs that are _not_ passed outside the linker's scope.

~~~
fanf2
C99 TC3 §6.7.2.1p13 "structure and union specifiers":

"Within a structure object, the non-bit-field members and the units in which
bit-fields reside have addresses that increase in the order in which they are
declared."

~~~
zurn
The as-if rule allows compilers to perform this optimization, provided they
elide the optimization for those structs that have addresses of members
taken/used, though. This requires whole program analysis of course (as in
clang / gcc "LTO" compilation mode). And of course there could be language
extensions to loosen this rule further on a per-struct basis.

------
fovc
In the linked slides, DJB talks about a language for communication with the
compiler, separating optimizations from specification. This reminded me of
VPRI's "Meaning separated from optimization" [1] principle. Does anyone know
what became of that line of thinking? Is this idea making it's way into Ohm? I
remember reading a post/paper about optimizing Nile/Gezira to better exploit
the CPU cache (and the struggle to use SIMD), but can't seem to find it now.

[1]
[http://www.vpri.org/pdf/rn2006002_nsfprop.pdf](http://www.vpri.org/pdf/rn2006002_nsfprop.pdf)

~~~
nickpsecurity
IBM's old PL/S language allowed you to give hints like where data would go or
what checks would happen right in the function declaration. The compiler would
handle it from there.

------
jerrre
> Compiler optimization reduces code size

Nope, much is gained by unrolling loops, inlining functions etc, which all
increase code size.

Of course C++ compilation with no optimization at all can be rather wasteful
with performance and code size, but to squeeze the final performance out you
probably need to sacrifice code size (whether manual or automatic)

~~~
aidenn0
> Nope, much is gained by unrolling loops, inlining functions etc, which all
> increase code size.

That's because those are performance optimizations, not size optimizations
(though as an aside, inlining functions _can_ reduce code size in the event
that the inlined version is smaller than the function-call overhead, or in the
case where the function is used only once).

There are plenty of size optimizations that can be performed. -Os will enable
them on gcc/clang if you want to try for yourself.

------
nullc
I'm disappointed at the lack of figures.

------
mrkgnao
I'm posting this as a top-level comment, but it's really a reply to the
discussion downthread about compilers being able to work magic if we let them.
Better still, why not help them?

Something I took for granted for the longest time about Haskell (which remains
the only language I know of with the feature) is the ability to write user-
defined "rewrite rules". You can say, "okay, GHC, I know for a fact that if I
use these functions in such-and-such way, you can replace it by _this_
instead".

    
    
      {-# RULES 
        foo x (bar x) =  
        superOptimizedFooBar x 
       #-}
    

A rule like this would be based on the programmer's knowledge of FooBar
theory, which tells her that such an equality holds. The compiler hasn't
studied lax monoidal FooBaroids and cannot be expected to infer this on its
own. :)

Now, anywhere a user of this code writes something like

    
    
      foo [1,2,3] (bar [1,2,3])
    

the compiler will substitute

    
    
      superOptimizedFooBar [1,2,3]
    

in its place. This is a nice way to bring the compiler "closer" to the
programmer, and allow the library author to integrate domain-specific
knowledge into the compiler's optimizations.

You can also "specialize" by using faster implementations in certain cases.
For example,

    
    
      timesFour :: Num a => a -> a
      timesFour = a + a + a + a
    
      timesFourInt :: Int -> Int
      timesFourInt x 
        = rightShift x 2
    
      {-# RULES
          timesFour :: Int -> Int
        = timesFourInt
       #-}
    

If you call timesFour on a double, it will use addition (ha!) but using it on
an Int uses bitshifting instead because this rule fires.

High-performance Haskell libraries like vector, bytestring, text, pipes, or
conduit _capitalize_ on this feature, among other techniques. When compiling
code written using libraries like this, this is how it goes:

\- rule #1 fires somewhere \- it rewrites the code into something that matches
rule #2, "clearing the way" for it to fire \- rule #2 fires \- rule #3 fires
\- rule #1 fires again \- rule #4 fires

and so on, triggering a "cascade" of optimizations.

The promise of Haskell is that we already have a "sufficiently smart
compiler": _today_ , with good libraries, GHC is capable of turning clear,
high-level, reusable functional code with chains of function compositions and
folds and so on into tight, fast loops.

\--

I must add, though, that getting rewrite rules to fire in cascades to get "mad
gainz" requires one to grok how the GHC inliner/specializer works.

[http://mpickering.github.io/posts/2017-03-20-inlining-and-
sp...](http://mpickering.github.io/posts/2017-03-20-inlining-and-
specialisation.html)

Data.Vector also utilizes an internal representation that makes fusion
explicit and hence predictable (inevitable, even) called a "bundle":

[https://www.stackage.org/haddock/lts-8.16/vector-0.11.0.0/Da...](https://www.stackage.org/haddock/lts-8.16/vector-0.11.0.0/Data-
Vector-Fusion-Bundle.html)

but this relies on rewrite rules too, e.g. the previous module contains this
rule:

    
    
      {-# RULES
    
      "zipWithM xs xs [Vector.Stream]" forall f xs.
        zipWithM f xs xs = mapM (\x -> f x x) xs   #-}

~~~
openasocket
My biggest concern with something like this is how it affects debuggability
and the principal of least surprise, for a couple of reasons. The biggest is
that in the presence of bugs. If there is a bug in foo but you only use foo in
the context of foo x (bar x) then all those operations get transformed and you
end up with the correct behavior even though your code has a bug that will
suddenly appear when you use foo in a way that that rule isn't applied. Or
there's a bug in superOptimizedFooBar, so you correctly write "foo x (bar x)"
which is correct, but you get the wrong results, and you may waste time trying
to debug foo and bar not realizing the rule replacement. And there is also the
possibility of bugs in the rules themselves. In general, I'm a little hesitant
to use something that is replacing my code behind my back.

It is very interesting, though, and is a good avenue to explore. I've got some
reading to do :)

~~~
mrkgnao
You're right, of course. There are ways around this: for one, without
optimization (i.e. -O0) you don't have any rewrite rules firing, so any
discrepancies in behavior can be tracked down this way.

In practice, most of the libraries that include tons of
rewrite/specialise/inline (ab)use are either "core" libraries (like
vector/bytestring) or have a large userbase (e.g. Conduit), and rules don't
really change too much from version to version, so this has never actually had
the detrimental effects that "silent replacement of code" might have in the
worst case.

This might[0] sound similar to apologetics for dynamically-typed languages:
the only real answer to your question is that rewrite rules are ultimately a
way to improve performance, and they come with caveats usually associated with
replacing clearer algorithms with faster ones. (I'm thinking of the adaptive
sort from the C++ STL, which iirc uses different sorting algorithms for
differently-sized containers to maximize expected performance. It's not
exactly intuitive that vectors of size 100 and vectors of size ~10000 are
sorted differently, is it?)

Of course, the only verifiably-correct way to do these things is through the
use of machine-checkable proofs of the transforms. The well-known
"foldr/build" rule that many optimizations of linked-list functions reduce to,
for instance, has been proven _outside of GHC_ , and there are usually similar
proofs for other rules known to the library author. The appeal of dependently-
typed languages is how they are a nice vehicle for the inclusion of those
proofs in library code: if the compiler can't verify that your rule makes
sense, it can complain instead. You then end up supplying a proof, or saying
"just believe me, okay?" ;)

[0]: "In practice"? Seriously? Next thing I know I'll be parroting "in the
real world" and so on...

------
faragon
Is there any compiler using "machine learning" for SIMD optimization?

~~~
Verdex_2
I'm not sure, but I remember seeing some research into using it for Haskell
stream fusion (can't find the video, sorry).

I believe that the basic idea was that not all rewrites end up being equally
fast ( a * b * c can be fused into ab * c, but maybe a * bc is faster). Trying
all the combinations is an option in theory, but you get a combinatorial
explosion so you normally dont get far in practice. Enter machine learning.
I'm not sure how successful they were, but I imagine that the same sort of
thing could be applied to SIMD.

