
Beating the Compiler - panic
http://www.codersnotes.com/notes/beating-the-compiler/
======
Joky
This seems quite ridiculous to me, I have seldom seen "modern compilers are
always faster than you" but rather "they are good enough that it is not worth
it". It provides a very over-confident "conclusion" based on a single dubious
test.

The main advantage of compilers is that the optimizations _scale_ across a
large codebase through inlining for example.

Also, just moving from Sandy-Bridge to Haswell for example can have
significant performance swing (in both direction). The maintenance cost of the
assembly is again a scaling issue.

If you have a single function that takes a significant amount of time in your
program, and performance is critical, of course you can try to go with lower
level. But it is likely that it will be more profitable to start with 1) pre-
optimized libraries (i.e. don't write your own "sort") ; 2) follow the
optimization guidelines of the CPU vendors regarding memory layout, etc. ; and
3) start with vector C-level intrinsic if possible if you can benefit from
vectorization.

~~~
kayamon
On the contrary, I've often seen the "you can't beat the compiler" statement.
This[1] recent reddit thread has it in the top comment, which is what prompted
me to test it out.

And while all those other points are fine points (and I mention all that in
the conclusion), it doesn't change the fact that beating the compiler isn't
always the rocket science it's made out to be.

[1]
[https://www.reddit.com/r/programming/comments/5f9evm/learnin...](https://www.reddit.com/r/programming/comments/5f9evm/learning_to_read_x86_assembly_language/)

~~~
imauld
Are you the author of the linked post? If so I have a couple of questions:

\- Why not throw out the best and worst cases for each and then find the mean
of run times? Seems like a more "fair" way to compare them.

\- Did you compare the assembly generated by the compiler to the assembly you
wrote?

~~~
Vendan
You should always be comparing best case for this kind of thing. Slower cases
are most likely "your thread got switched out by the OS to let something else
run", and that's not really a fair test.

~~~
zamalek
Which is why you use 90th percentile.

~~~
to3m
What's wrong with using the best result? If you're concerned the code could
run faster than possible: don't be :)

~~~
bArray
What if the test data is random? Could just get lucky and get a happy day
scenario.

~~~
Vendan
Then both versions should get that "happy day"? If you are using distinct
random data for each version, then you aren't really benchmarking properly.

~~~
bArray
It doesn't say that they both use the same data sets.

~~~
Vendan
If they are using different data sets, then I'd say it's an invalid benchmark.

------
bjourne
I ported the recursive variant of the quicksort test and ran it on my
computer. Changes I made was to replace the Windows specific timing functions
with Linux-specific clock_gettime() calls. Then I also changed the rcx and rdx
registers to rdi and rsi because those are what the Linux 64bit calling
convention uses.

Here are my results:

    
    
        sort_asm_recurse.asm 69 ms/loop
        clang++ 3.8.0/sort_cpp_recurse.cpp 65 ms/loop
        g++ 5.4.0/sort_cpp_recurse.cpp 70 ms/loop
    

Compiler flags: -O3 --std=c++11 -fomit-frame-pointer -march=native
-mtune=native

So on my computer, the assembly code (barely) beat g++ but not clang++. From a
cursory glance of the assembler code clang++ generates, the difference seem to
be that it adds alignment to critical loops.

It is also smarter at using 32bit registers when it can get away with it. F.e
the handwritten assembler code contains "xor r9, r9". An equivalent but faster
variant that the compiler generates is "xor r9d, r9d".

There is also a slight error in the assembly code. rsp should be aligned to a
16 byte boundary when a call instruction is executed and the code doesn't
ensure that. Likely it loses a whole bunch of performance by calling from
unaligned addresses.

~~~
kayamon
It's interesting that your clang and my clang give different results, even
though we're using the same version. I suspect it's a result of differing CPU
architectures. (i.e. my CPU is a different model to yours perhaps).

I originally did put loop alignment in my asm version, but I took it out
because it was actually ever so slightly slower on mine. Make of that what you
will.

~~~
bjourne
That's very likely. Mine is an AMD Phenom(tm) II X6 1090T. Though I changed
your code a little so that the intro looks like this:

    
    
      sortRoutine:
      	; rdi = items
            ; esi = count
            push rbp       ; <- stack alignment push
      sortRoutine_start:
            cmp esi, 2
            jb done
            dec esi
    

The "cmp esi, 2; jb done; dec esi" corresponds to your "sub rdx, 1; jbe done".
That improves it on my machine to 63 ms/loop. If you are interested I can put
it online somewhere.

~~~
qb45

      push rbp       ; <- stack alignment push
    

This shouldn't be needed. 8 byte alignment is fine for the CPU itself. The
purpose of 16 byte alignment is to facilitate making 16 byte aligned stack
allocations.

~~~
bjourne
RSP needs to be aligned to 16 bytes at CALL sites. See f.e
[https://blogs.msdn.microsoft.com/oldnewthing/20040114-00/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20040114-00/?p=41053)

~~~
qb45
See the first answer here

[http://stackoverflow.com/questions/612443/why-does-the-
mac-a...](http://stackoverflow.com/questions/612443/why-does-the-mac-abi-
require-16-byte-stack-alignment-for-x86-32)

I also checked similar manual for AMD and it doesn't seem to mention RSP
alignment at all, except that "some calling conventions may require ...".

The CPU doesn't care. It only matters when you call functions which allocate
16B objects on the stack.* _This_ function calls only itself and pushes only
8B words on the stack so it's fine with 8B alignment.

* _Some functions generated by C compilers do and they segfault if you call them with wrong alignment. Ask me how I know._

edit:

OK, so I downloaded this code. Results:

    
    
      as-is:     78111us
      push rbp:  73093us
      sub rsp,8: 72332us
      sub rax,8: 72222us
    

Seems to be a matter of instruction alignment, nothing to do with the stack.

------
dalailambda
While this may seem silly to some people, I definitely appreciate the
sentiment. "The compiler is smarter than you" is thrown around often here, and
on Reddit, and a lot of people consider it "common wisdom", but it's not
really correct.

Writing code is having a dialogue with the compiler, it can do better than you
sometimes, and vice versa, but treating the compiler as a magic box that
always spits out faster code than you is pretty silly.

~~~
unscaled
I can see where this received wisdom is coming from: a counter-reaction to the
common tendency we had well into the 90s to hand-optimize every procedure
considered to be even remotely on the hot path. It didn't even have to be
inline assembly: it could just be C code sprinkled with registers, Duff's
devices and bit shifts.

That used to work well enough for non-portable code targeting a limited range
of CPUs, but nowadays the gains are too little , the RoI is negative and these
efforts may actually end up backfiring on you.

I guess we needed to spread the knowledge that "the compiler is smarter than
you" even if it wasn't really accurate, just to stop people from doing crazy
stuff out of pure inertia.

~~~
Annatar
_I can see where this received wisdom is coming from: a counter-reaction to
the common tendency we had well into the 90s to hand-optimize every procedure
considered to be even remotely on the hot path. It didn 't even have to be
inline assembly: it could just be C code sprinkled with registers, Duff's
devices and bit shifts._

That's not it at all. The original problem was that the compilers generated
several orders of magnitude larger and slower code than what we could code in
the demo scene, and other than processor or memory, made zero utilization of
the hardware or DMA. And in the demo scene, if you're not getting the maximum
performance out of the hardware, you might as well be dead -- "demo or die",
as Chaos of Sanity (now Farbrausch) so famously put it.

Compilers didn't really catch up with us: the fastest and best they can do
using hardware instead of just the CPU and RAM is CUDA Fortran (pgi Fortran
compilers). I know of no compiler taking advantage of DMA or audio hardware,
let alone co-processors like for example the Copper and the Blitter. Even on
systems like PS3, the GCC compiler took zero advantage of the RSX chip -- it
was just a generic PowerPC compiler.

Surely a compiler will sometimes beat a human by generating a perfectly or
near perfectly scheduled sequence of instructions for a particular processor,
but a human can write a generic piece of assembler code that will get really
good performance across a range of different chips in a given processor
family, and so still beat a compiler overall.

------
chriswarbo
Compilers are usually at a disadvantage compared to human programmers, as
they're under pressure to produce code as quickly as possible; seconds if
possible, minutes at worst. A human may spend many hours or days writing,
profiling, testing, etc. This biases the kinds of algorithms that compilers
use (especially JITs, since they have even stricter requirements).

It would be nice to have a
compiler/optimiser/analyser/profiler/tester/fuzzer/etc. designed to run for
long periods, running all sorts of improvement-finding algorithms, building up
a knowledge base about the code on disk (which can be updated incrementally
when the code changes), and providing reports and messages to the user.

When we're about to embark on a deep dive, for optimisation/debugging/etc. we
can fire up this assistant and have it running for the entire time we're
devoting to the problem. It can even keep running overnight if we spend
several days on the issue.

~~~
refset
Your description of the assistant reminds me of a Clojure talk I watched
recently[1] where the speaker outlines how a central pool of knowledge about
invariant compilation properties could embody the scientific method.

[1] "Bare Metal Clojure with clojure.Spec"
[https://www.youtube.com/watch?v=yGko70hIEwk](https://www.youtube.com/watch?v=yGko70hIEwk)

~~~
chriswarbo
Not watched the talk yet, but sounds very similar to my own thinking. For
example, I think the "pipeline" approach of compilation (preprocess -> lex ->
parse -> desugar -> inference -> check -> optimise -> code generation) is very
restrictive, as it precludes many other activities (e.g. static analysis),
forcing custom lexers, parsers, etc. to be created in parallel, which may-or-
may-not work with existing code, etc.

I think a more data-based approach would be preferable, for example we might
think of "source text", "preprocessed", "lexed", etc. as being tables in a
relational database, and the phases of the pipeline as views/projections which
populate one table from another. Optimisation would just be a one:many
relation, with a single source text corresponding to multiple possible
expressions. This data could be stored, collated, mined, augmented, etc. by
various other processes, which allows more approaches to be taken than _just_
compiling.

Of course, this is just one idea; and the relational part would only need to
be an _interface_ to the data, it could be computed lazily, and wouldn't
necessarily be _implemented_ with an actual RDBMS storing all of the
intermediate bits.

------
prestonbriggs
It's easy to beat a compiler in the small - just takes time & patience. But
such an approach doesn't scale. We don't write tiny routines and throw them
away; instead, we write big programs made of lots of routines & classes, and
we maintain them for years, probably porting them from machine to machine.

I encourage everyone to write some assembly; you'll learn a lot. But use a
compiler for your work.

------
pkolaczk
If he sorts 1 mln items, I guess he runs out of L1 cache and probably out of
L2 cache. Therefore memory accesses may pay the biggest role here and that
explains why he sees almost no improvement from recursion elimination.

------
partycoder
Intel and AMD publish programming guides that would help you producing more
optimized code.

Then there are some aspects that compilers might not optimize a lot for. I
like this guide:
[http://www.farbrausch.com/~fg/seminars/lightspeed_download.p...](http://www.farbrausch.com/~fg/seminars/lightspeed_download.pdf)

It's old, dated, whatever you want, but covers the basics.

edit: it seems that link got the "HN hug of death".

------
illys
About Human vs Compiler, I see a very different issue: most developers
(especially at Big Corps) only know objects and do not have a clue about how a
processor is processing.

As a result, most high level programming has very poor performance - whatever
the compiler quality. This is certainly why we keep waiting seconds for simple
operations.

Questioning compiler output is a very good exercise to become a better
developer, whether you can beat the compiler or not.

------
gabrielcsapo
"I suppose if there's anything to be learned here, it's that people on the
Internet may sometimes be full of shit." most undervalued quote.

------
donovanr
Sedgewick's 1978 paper[0] on implementing quicksort has some interesting hand
optimizations of the assembly code -- loop rotating, unrolling, etc. I wonder
if modern compilers do the same?

[0]
[http://penguin.ewu.edu/cscd300/Topic/AdvSorting/Sedgewick.pd...](http://penguin.ewu.edu/cscd300/Topic/AdvSorting/Sedgewick.pdf)

~~~
pertymcpert
Yep, loop rotation and unrolling are done very commonly.

------
sweettea
Bestcase seems like a poor metric when the CPU scheduler could certainly cause
7% variation. I would be interested to see, say, 100x the number of runs, and
see mean rather than best, since one usually cares about average more than
best.

I also wish I knew what optimization settings GCC/etc was using, and what
effect tweaking those has.

~~~
Joky
Because of noise in general, "best case" seems always like the best metric to
me. Over a large number of run, you're likely to hit the "perfect" measurement
with on a microbenchmark.

Otherwise, for an "adaptive" number of runs till enough time is spent to have
some "confidence" on the measure, I've been fairly happy with:
[https://github.com/google/benchmark/](https://github.com/google/benchmark/)

~~~
andrepd
Just show more statistics: mean, variance, min, max, at least.

------
swolchok
It's not mentioned in the article, so I'll note that the code presented is
Windows-specific. Windows uses a different calling convention
([https://en.wikipedia.org/wiki/X86_calling_conventions#Micros...](https://en.wikipedia.org/wiki/X86_calling_conventions#Microsoft_x64_calling_convention))
from the one used on Mac and Linux systems
([https://en.wikipedia.org/wiki/X86_calling_conventions#System...](https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI)),
so if you want to see the assembly you get from clang on Mac, you'll want to
annotate sortRoutine with __attribute__((ms_abi)).

------
DannyBee
Yes, you can pretty easily beat the compiler in simple cases when you do this.

I would seriously challenge anyone to try to, by hand, do what PLUTO+ does .
[http://dl.acm.org/citation.cfm?id=2688512](http://dl.acm.org/citation.cfm?id=2688512)
It is implemented in at least one real production C++ compiler. The analogue
would be graphite in gcc, and polly in llvm, but they don't have the full cost
modeling it does. Then try to do it for multiple architectures or even
different cache models (IE newer vs older processors).

Even simpler things than that, like deciding when it is profitable to add
runtime vectorization/alignment checks, etc, is really hard by hand. Hell, in
larger functions, i doubt people can even optimally do register allocation
(including live range splitting, remat, etc).

So yeah, stupid quicksort, sure, you can beat it.

I'm not sure what it's supposed to prove?

If you restrict yourselves to small cases that are easily optimizable without
any thought, and not amenable to any even slightly advanced optimization, yes,
you can beat the compiler.

------
wictory
So I guess that the lesson to take from this post is that you can beat the
compiler. We should also appreciate that the people who did similar analyses
and did not get a speed up, most probably did not write a blog post about it.

------
swolchok
I would like to see in the article a discussion of the assembly the compiler
produces, how it differs from the assembly the author wrote, and perhaps why
the differences are worse.

------
mistercow
>where making good use of the SIMD intrinsics can allow assembly to massively
beat the compiler.

Is this the correct use of this terminology? I thought intrinsics were
functions that allow you to tell the compiler to use particular instructions,
specifically so you can avoid dropping into assembly. In assembly, wouldn't
you just call them "instructions"?

~~~
kayamon
Yeah you're right, I'll fix that.

------
keithnz
I'm just curious if there is any overhead in the compiler outputs as the
author seemed to be timing the .exe

It would be interesting to see the assembly output of all the compilers, and
what the compiler settings are

~~~
kayamon
The timing happens directly around the function itself inside the EXE.

Compiler settings are in the makefile, full optimization (-O3 or /Ox)

------
mnarayan01
Should the second assembler statement use `jle done` rather than `jbe done` to
preserve the original semantics? (I know nothing about assembly so could be
missing something obvious.)

~~~
kayamon
Yeah, it probably should be. It doesn't affect the performance. The difference
would only manifest if you passed in a negative count, which would be an error
anyway.

~~~
mnarayan01
If your version goes UB on a zero length input array, then I think the
compiler not only wins, but wins by _a lot_.

Obviously you can easily fix it, but the statement people generally make is "
_You_ can't beat the compiler" not "(You, me, whomever else [that was as far
as I got], and/or a huge time investment) can't beat the compiler". All that
said...people categorically saying "You can't beat the compiler" annoys me too
(though in my case they're right; I can't).

~~~
kayamon
It only fails on a _negative_ count. Zero count works. If your program is
passing around negative counts, it's already broken, and the exact specifics
of where and when the brokenness manifests aren't particularly important.

Technically I should have used a size_t instead of an int for the count
anyway, so it's kinda a moot point. I just picked int to make a simpler toy
example program.

------
leitasat
Is not a tail-recusrion version of the quicksort algorithm needed to really
allow compiler to optimize performance?

~~~
kayamon
All of the compilers I tried automatically detected it and did their own tail-
recursion.

------
olzhas
why the best-case was chosen instead of mean or median?

~~~
emeryberger
You should _never_ do this. Best-case favors outliers and does not represent
expected performance, which is what we care about. Just because the stars
happen to align one time doesn't mean you report that run.

Consider the following runs of two systems:

system A: 10s, 10s, 10s, 10s, 10s, 10s, 10s, 5s

system B: 6s, 6s, 6s, 6s, 6s, 6s, 6s, 6s

Which one is faster? (Hint: don't say system A)

~~~
alephnil
You will in practice hardly ever see outliers like you descrivbed in system A,
where one run is significantly faster. You will often see cases where one run
is significantly slower. The reason could be things like cache misses, swapped
out code, some bad code path happening etc (all these on very different
timescales). These things tend to happen only occasionally, so the reversed
case from your example A (seven five second runs and one ten seconds run) is
more pluasible. Because such factors tend to be things you can't easily
control, taking the minimum is a good approximation when optimizing a code
snippet as opposed to the whole program.

------
titzer
Nice job. Here's 900,000 lines of C++ code for you to now translate to
assembly. And after you're done with that, I'd like to change a few lines and
have you do it over again, preferrably 100 times a day.

/sigh /compiler person

~~~
titzer
It'd be nice if people replied instead of just downvoting.

You're missing the point of a compiler. It does a huge amount of work to
reliably get a very, very good solution to a huge problem in a reasonable
amount of time. Depending on the optimization settings, of course it is not
going to try its hardest to get the very best code out of every single
function. Besides, you can always use the output of the compiler as your
starting point for hand optimization.

Why don't you try your hand at some Fortran kernels where a compiler might
spend a few minutes or hours optimizing the hell out of something extremely
important? I doubt you'll beat a Fortran compiler at its main job.

No one is claiming that you can't beat the compiler some of the time. But you
can't beat the compiler even 0.01% of the time, given how much code there is
out there.

~~~
kayamon
I can't beat the compiler even 0.01% of the time? Really? Because that's the
point of the article -- I just grabbed the first piece of C I found and
managed to beat the compiler.

