
You can’t beat a good compiler…. right? - mnemonik
http://lbrandy.com/blog/2010/06/you-cant-beat-a-good-compile/
======
ori_b
You can easily beat a good compiler for many kinds of code. It just takes
time, effort and a good deal of knowledge. If you're doing it outside of the
hottest kernels of your code, you're probably wasting your time.

~~~
kenjackson
And it also requires that you pay attention to your evolving toolset. I've
worked on x86 code that was hand-optimized back in 1995. Once I reverse-
engineered it and handed it to Visual C++ the code gen'ned by VC killed it, by
like a factor of 2x.

I emailed the code to the guy who originally wrote it and his response was
brief, "The compiler wasn't that smart before". But yet the code sat there for
like 10 years, untouched -- largely because people don't like to touch x86
assembly code that is working.

The moral, you may get a benefit today, but you don't get any future benefits
of the investment at MS, Intel, the Gcc folks, etc..., who are improving the
compilers. Sometimes it is the right call to hand optimize. But when you do, I
suggest having a C/C++ version alongside of it that one can test aganist it
whenever the compiler revs.

~~~
zbyszek
An anecdote in a similar vein:

I was involved in writing (Fortran) code for Crays in the 90s. For the T3D, we
speeded up the code with the usual set of tricks -- loop unrolling, array
padding and so on. With the later T3E, it turned out that not only could the
compiler now do all that, but that code hand-optimised in this way performed
worse.

~~~
kenjackson
Heh, we may know each other. I also developed code on the T3D and T3E. I
figure can't be too many people in the world that developed for both. I still
have the big gray Cray binders with the documentation laying around somewhere.

------
wglb
This is really a matter of scale and reality. Let's think in survival terms.
Some applications, oh, say High Frequency Trading depend on getting the
calculation done first. We are talking milliseconds or less to do a
calculation that leads to placing, cancelling, or modifying an order. Do they
use assembler? No. Problems of scale requiring massive hardware or clusters,
do they use assembler? Not generally. Often they use Lisp, C++ or C.

So in the context of real-world problems, compilers win, because none of us
has the energy or attention span to write it in assembler, something the size
of a real-world problem like, for example, a compiler.

We optimize small problems by hand, but problems of useful size we don't
because the compiler does a better job on problems of true interest than we
have the patience or time to do.

To put it another way, it is better to put energy into improving compilers
than getting into a John Henry type of competition.

~~~
zbyszek
Your point is well made, but if the large, relevant problem can be broken into
small sub-problems, and if one of those consumes a significant amount of
compute time, then that might (just might!) be worth optimising in assembler.

My experience of problems requiring massive hardware or clusters (limited, I
admit, to certain classes of scientific computing) is that such assembler
coding of speed-critical components is actually often done.

------
InclinedPlane
John Henry won the race against the steam-hammer too, before he died. The
point is not that it's impossible to beat a good compiler at micro-
optimization, the point is that it's rarely worth the effort. More important
than being able to beat the compiler at optimizing any given piece of code is
building the tooling (and having the discipline and experience) to know when
and where to best leverage your development efforts in optimization.

Rather than trying to carpet-bomb everywhere with micro-optimization hand-
grenades, use laser guided precision to hit the targets that count. So instead
of wasting your time manually unrolling loops in code that's just going to be
refactored away before release anyway, find out exactly where your code is
bogging down and concentrate on fixing that.

------
maweaver
Wouldn't something like

    
    
        float temp_out = 0.0f;
        for (j = 0;j<5;j++)
             temp_out += filter[j] * in[i+j];
        out[i] = temp_out;
    

solve the performance issue, and allow for out = in? (which I would think
would be desirable, in cases where you don't need the original input).

Also, this is way outside my area of expertise, but when I hear convolution I
think FFT. I would be curious if anyone knows whether there is an algorithmic
improvement to be made here.

~~~
cameldrv
I think that your solution is better in that it's simpler, but I'm not sure
what the compiler will do in this case. Theoretically, in[i+j] could point to
temp_out, but it could only be through very tricky means, since temp_out is on
the stack, and the caller presumably doesn't know where it will be allocated.

An FFT wouldn't be faster in this case since the filter only has five taps.
Generally there has to be a decent sized kernel for the FFT to be worth it --
the FFT is O(N log N), and this algorithm is only 5*N operations.

~~~
JoachimSchipper
The compiler is free to optimize in this case, because there's no
_requirement_ that in[i + j] point to temp_out, even if you manage to make the
stars align.

------
JoeAltmaier
Better examples of good compiler optimization would be other than math loops.
Math is kind of shooting fish-in-a-barrell. Where the compiler shines, is when
it optimizes thru necessary abstraction. Sure you can hand-code a flat struct,
but you rather have objects with ctors etc. (insert favorite abstraction
here).

------
ramchip
Starts by saying how it's a huge myth that compilers generate better code than
hand-made assembly... and ends with compiler-generated code, the author
basically saying "I think I could do better".

It was interesting to see what the compiler generated, though it won't be a
big surprise to experienced C coders.

~~~
lbrandy
> Starts by saying how it's a huge myth that compilers generate better code
> than hand-made assembly...

It's important to note that I never said this.

I was simply trying to separate two versions of the same rule of thumb... I
said that using a compiler, a smart low-level programmer can generate the
assembly code he wants. This is not "hand-made" assembly, but this is not just
the magic of an optimizing compiler, either.

~~~
ramchip
I do see your point and I found it interesting (I've had to optimize filters
in the past and used the same process of coding C and looking at the
assembly), but how can "beating the compiler" be interpreted as anything but
writing better code than the compiler?

~~~
Retric
Writing optimized code generally involves looking at larger sections of the
problem / code base and picking a different approach based on global
conditions.

So, when you and the compiler are booth looking at the same code base, but you
find a better local approach based on a global condition you have beaten the
compiler. Even if all you did was tell it about that condition.

------
Malic
I think the Euro-demo coders of the 80s made it clear that handwritten
assembly can be optimized to a ridiculous point of efficiency.

~~~
sjf
Very true. But these were specifically intended to test the limits of coding.
In other situations, you have to ask yourself, is this really adding value?

------
jheriko
Good post.

I'm a big fan of looking at the assembler code generated from programs - even
if you are not optimising - it helps give a depth of understanding which is
simply not possible without doing it.

Not to mention that the "smart" compiler is a myth - I've seen some downright
stupid asm come out when the compiler is configured wrong - and it gets
missed, even with profiling based optimisation targetting the area - because
looking at assembler is actively discouraged.

We are spoiled with IDEs and high level languages...

------
JoachimSchipper
This has some caveats that I pointed out at
[http://lbrandy.com/blog/2010/06/you-cant-beat-a-good-
compile...](http://lbrandy.com/blog/2010/06/you-cant-beat-a-good-
compile/comment-page-1/#comment-26448); still, a nice enough article.

EDIT: note that the off-by-one error is MY mistake - it simply isn't there. My
point about "restrict" is valid, though.

~~~
lbrandy
This is a terrifying post for me to make because I feel like I'm about to make
an ass of myself... but...

I'm not seeing an off-by-one error. 'i' will be LENGTH-5 on the last iteration
and we will add 'j==4' for in[LENGTH-1] not in[LENGTH].

~~~
JoachimSchipper
You are actually right. Sorry! I shouldn't be trying to do N things at the
same time - I somehow misread that as i <= LENGTH - 4.

------
zwieback
I agree that checking assembly output is very good practice and even more
important on not-so-common architectures. x86 and ARM compilers are very good
now but if you're working on a different processor the compiler might not be
so good.

Especially on DSPs and microcontrollers you may end up changing instruction
order to avoid pipeline stalls or use on-chip RAM buffers for your often used
data.

Also, if you start writing a routine in assembly from scratch it's often
surprising how many different ways you can implement an algorithm since you're
not as bound by the calling conventions and runtime assumptions made by the
compiler.

Not that I would recommend that for anything other than the innermost loops.

By the way, are we talking about anything other than C or maybe Fortran here?
I'm guessing Java or C# coders don't spend too much time looking at the IL, I
know I've only done it for educational purposes.

------
SkyMarshal
Speaking of this, can anyone recommend any good resources for learning
assembler code, targetted at noobs who know some C but nothing more low level?

I need to plug some holes in my skills, and this is one of them.

~~~
nitrogen
I wouldn't necessarily recommend this approach, but the way I learned to read
x86 assembly (I've never needed to write it, but I do write other assembly
languages) was by reading Michael Abrash's Graphics Programming Black Book and
disassembling a Windows driver for a piece of hardware in order to create a
Linux driver. Having a practical motivation other than just "learning assembly
language" can be beneficial.

You could also try getting an embedded development kit, like the USB
Bitwhacker, PICKit, or an AVR-based kit, and programming a microcontroller in
assembly language.

------
DCoder
In certain cases, this level of optimization makes perfect sense. In other
cases, arguably more common when writing C++, it makes more sense to invest
that time in understanding the object model, i.e. what actually happens when
you pass a struct by value, return an object (NRV), where invisible object
copying occurs (STL, I'm looking at you) or similar issues. Lippman's "C++
Object Model" is a good reference for that.

