

Will it optimize? - mark_h
http://ridiculousfish.com/blog/archives/2010/07/23/will-it-optimize/

======
stcredzero

        > And lastly, this isn’t meant to be a prescriptive post,
        > but we all know why micro-optimizing is usually a 
        > mistake: it wastes your time, it’s easy to screw up 
        > (see question 5), and it *typically* produces no 
        > measurable speedup.
    

I emphasized the _typically_ because this is a key idea: you are fallible --
_deal with it_!

I visited a local hacker-space with a friend of mine to talk about a project
of ours. A bunch of those guys are NASA and they know quite a bit about how to
engineer for reliability. Since our project involved a lot of soldering, he
suggested that we estimate our average solder join failure rate and use that
as the number of expected failures to look for. That's what he does in his
personal projects.

I exclaimed that was brilliant, then was a little embarrassed. Of course! Duh!
I know I'm not perfect and I'm smart enough to have deduced that I have
something like an average failure rate for solder joins. So why have I been
proceeding for years and years as if I could just do everything perfectly if I
just tried hard enough?

Every new function you write is a risk. Minimize your risks. Maximize the
cost/benefit. Every routine you optimize is a risk. Minimize your risks.
Maximize cost/benefit.

One's problems as a programmer are more likely the result of excess hubris and
not insufficient smarts.

EDIT: Fear is as big a problem as hubris. Walk the middle path. You are your
own optimization problem, and the optimal solution is almost never a no-
brainer shove of a lever all the way in some direction.

~~~
flogic
That's awesome. I'm always impressed by the reliability work that comes out of
NASA. I don't regret a single penny of my tax dollars which goes to them just
for this sort of thing alone.

------
wooster
ridiculousfish is the smartest programmer I've ever met. I hired him at Apple
years ago, and while I could regale you with tales of how awesome he is, Apple
would probably sue me.

That said, if you've ever played World of Warcraft on the Mac, you should
worship him as a god.

:-)

~~~
zackattack
This dude is pretty much my dream cofounder. Math background, aesthetically
intuitive, intellectually curious, specifically interested in psychology, and
clearly a talented programmer.

[http://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/Cor...](http://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/Corydoras_semiaquilus_1.jpg/800px-
Corydoras_semiaquilus_1.jpg)

~~~
messel
Wooster would never let him go. Better to offer woost an irresistable offer,
then go after this fellow.

Seriously beautiful post layout

~~~
pbhjpbhj
>Seriously beautiful post layout

Shame about the blog chrome. Seriously all I had above the fold was the fish
picture and until I looked at the scrollbar I'd nothing to indicate that the
page had any content, I assumed it was some sort of flash game.

Also within the post the code overflows the code boxes for me (I'm 2 clicks up
on font size).

Looks to be some problem on the all-posts page too -
<http://ridiculousfish.com/blog/all-posts/> FWIW.

~~~
messel
I'm using the latest chromium daily build and it looked fine.

------
jakevoytko
I avoid these types of optimizations, say, 97% of the time

This reminded me of Linus' blog post on optimizing SHA1 routines in Git [0].
FTA: "Some people seem to think that C is a real programming language, but
they are sadly mistaken. It really is about writing almost-portable assembly
language, and it turns out that getting good results from SHA1 really is
mostly about trying to fight the compilers tendency to try to be clever."

[0] <http://torvalds-family.blogspot.com/2009/08/programming.html>

~~~
grogers
There is a self tuning linear algebra library called ATLAS. When building,
probes various things about your computer, things like L1/L2 cache, if you
have fused multiply/add, etc. It uses these to output very highly tuned matrix
kernels in standard C.

Ironically, when compiling this C code, you have to use -O, higher
optimization levels make the compiler try to be too smart for its own good and
actually perform worse.

------
grogers
"2. Loop-invariant strlen()

GCC does this optimization, because strlen is a "built-in function:" one that
gcc recognizes and optimizes specially. Disabling built-ins with -fno-builtin
defeats this optimization."

No, it does it because strlen is declared to be a pure function, and it is
smart enough to realize you are not modifying anything in s. -fno-builtin does
not change the generated code, it still calls the strlen function (at least on
my gcc), but it is hoisted out of the loop.

~~~
astrange
strlen() is not declared pure in the OS X headers.

------
praptak
Gcc surprised me once. I had put some effort into replacing something like (x
<< 12) | (x>>20) with the appropriate roll instruction at the assembly level.
Only after seeing that the "optimized" code ran no faster than plain GCC, I
looked at the GCC output and of course the roll was there.

I believe it was GCC 3.*, so even older than the one mentioned in the article.

------
smcl
Ha, I really like this guy. I started reading another one of his blog entries
(a description of how compilers often optimise a divide by a constant where,
instead of issuing an expensive divide, it multiplies by a "magic" number and
then right shifts the result). After hitting lots of complex (for me anyway)
mathematical proofs, my brain tuned out slightly until I saw

"Did you skip ahead? It's OK. Here's the summary"

I like to think he deliberately did that for hungover-yet-still-interested
devs like myself.

~~~
sprout
Well, I never read proofs on the first pass. I mean, I read each expression,
but I don't bother trying to assimilate them into a coherent thought until
I've finished any plain-English associated text.

And even that I'll sometimes wait to properly comprehend for a second pass.

------
viraptor
Question number 2 really surprised me. What if the layout of memory is:

    
    
        (s)(s+1)(s+2)...(result)
    

If `s` has no null bytes until `result`, then `strlen(s)` will terminate on
`result`. However, if you change `result` while iterating over `s`, then
`result` will be not null and the `strlen()` result will change. Or is there
some rule in C standard that says local variables are not in the same memory
range as external pointers?

~~~
ramchip
Attempting to read s[i] when i is outside the bounds of the array is undefined
behaviour, so the compiler is free to ignore this case. See example at
[http://en.wikipedia.org/wiki/Undefined_behavior#Examples_in_...](http://en.wikipedia.org/wiki/Undefined_behavior#Examples_in_C.2FC.2B.2B)

    
    
         Certain pointer operations may result in undefined behavior:
        int main()
        {
          int arr[4] = {0, 1, 2, 3};
          int * p = arr + 5; // undefined behavior
        }
    

In short, you can't assume that result will be placed after the array. What if
result is optimized to a register? It actually sounds quite likely to me given
the tight loop.

~~~
afhof
With C comes the pedantry

    
    
        int main(int argc, char ** argv)
        {
          int arr[4] = {0, 1, 2, 3};
          int * p = arr + 5; // valid 
          *p; // undefined behavior
          return 0;
        }

~~~
mian2zi3
If you're going to be pedantic, it is best to be right. I don't have a copy of
the C standard here, but I do have a copy of the C++ standard. From 5.7
Additive Operators (discussing an expression of the form P+N, where P has
pointer type and N has integral type):

"If both the pointer operand and the result point to elements of the same
array object, or one past the last element of the array object, the evaluation
shall not produce an overflow; otherwise, the behavior is undefined."

That is, if either P or P+N don't point to the same array object, or one past
the end of the array object, the result is undefined. Thus,

    
    
      int arr[4] = {0, 1, 2, 3};
      ... arr + 4 ... // OK, points to one past the end of the array object
      ... arr + 5 ... // undefined
    

I'm sure the C standard is the same.

~~~
nitrogen
This is one instance where C++ and C are, indeed, the same. However, the
languages do have their differences, so I would caution those developers who
don't tend to read standards not to assume that C and C++ will behave
identically when given valid C code. Obviously, having referenced the C++
standard yourself, you are aware of such caveats.

6.5.6.8 from ISO/IEC 9899:201x draft of March 1, 2009:

 _...If both the pointer operand and the result point to elements of the same
array object, or one past the last element of the array object, the evaluation
shall not produce an overflow; otherwise, the behavior is undefined...._

------
drblast
Interesting article.

By any chance, does this sort of information exist at all for Java bytecode
compilers? Whenever I write Java code I try to do things the Java way but it
usually makes me cringe to think that things like accessing a boolean field
requires a function call.

I console myself by thinking that this all gets optimized away by the JIT
compiler, but I'd really like to know for sure.

------
tumult
Got me on the first and last ones. Cool post!

I think the fact that I got the first one wrong made me err too far into
thinking GCC was smarter than I had first assumed. And it is, but only for
certain things.

------
afhof
Concerning #2:

Does 'const' mean that the function is not allowed to change the contents of
the string, or does it mean that the contents of the string will never change?
Suppose the 'sum' function was running in one thread, while in another the
string that was passed in as 's' is being continuously written to. While this
may not make much sense to do, and the output would not have a reliable value,
this optimization would be the wrong thing to do.

~~~
okmjuhb
const only means that the function is not allowed to change memory through
that reference. It says nothing about aliases to the same memory.

But the "what if another thread changes the string" issue is a red herring.
The question to keep in mind when optimizing in a multi-threaded context is
"does the output of the compiler generated code correspond to at least one
possible interleaving of threads". Since this thread doesn't do any
synchronization, it's possible that all the function executes without being
interrupted by another thread, so this is a sound optimization.

------
albertzeyer
`x+x` is faster than `x*2`? Really? Why? And how about `x<<1`?

~~~
mian2zi3
In a modern CPU, instruction selection is much more complicated than, "Oh,
addl (or sll) is faster than umul," and depends on the micro-
architecture/pipeline, surrounding code, etc. Addition usually takes one
cycle, while multiplication might take several cycles (although, by
pipelining, you can generally execute a multiply per cycle per multiplier)
But, if all the other adders are full and the result isn't needed for a few
cycles, it might be fastest to use the multiplier. (I'm thinking of a
statically scheduled pipeline. Things get harder when the CPU does dynamic
instruction reordering.)

~~~
_delirium
Do any compilers attempt to take those sorts of concerns into account? It
seems like it'd be an interesting experiment, but it breaks some of the nice
abstractions--- afaik, strength-reduction is usually done before scheduling,
with the only architecture dependency being a table of costs assigned to each
operation. Letting scheduling constraints have input on those decisions seems
ideal but tricky, unless I'm missing something.

~~~
wwalker3
It would be very hard to accurately account for a microprocessor's out-of-
order instruction issue in a compiler.

A modern desktop-class microprocessor can have scores of instructions "in
flight" at once. Each new instruction that's decoded gets reordered or stalled
differently depending on the status of all of those previous instructions and
their hardware requirements.

A compiler could try to model the processor's microarchitecture and simulate
how it would execute the program, but that simulation would be inaccurate
because of data-dependent branching, interrupts, and other unpredictable
behaviors. Plus, the microarchitectural details you'd need to create such a
simulation are usually not available.

That said, compilers can still tailor code to the microarchitecture in a
general way. For example, if you know a processor can only decode one branch
instruction per cycle, you can try to reorder the compiled code to avoid back-
to-back branches in the instruction stream.

~~~
mian2zi3
> It would be very hard to accurately account for a microprocessor's out-of-
> order instruction issue in a compiler.

This is true. However, a lot of modern processors do in-order execution, they
just aren't made by Intel or AMD for desktops: GPUs, network processors, SIMD
DSPs, etc. When I was hacking compilers (6 years ago), compiling for these
processors presented difficult challenges and there was a real market for new
compiler technology. One project was for a network processor that was not only
in-order, but it was not interlocked: correctness depended on the compiler
having an exact model of the processor's pipeline. Some registers were not
registers, but latches. They got new values every cycle, so if you didn't gab
a value on the right cycle, it was gone forever.

------
mian2zi3
Nice post. This is how the Nullstone compiler optimization benchmark suite is
organized: two functionally equivalent versions of a C program, one
cleaner/suboptimal, the other optimized by hand. The goal is to have the
generated code perform the same in either case.

------
nitrogen
Wouldn't it be possible to optimize floating point multiplication by 2.0f (or
any power of two) to an addition or subtraction on the exponent?

~~~
astrange
Only on architectures with such instructions. x86 doesn't have anything like
that (well, maybe in x87), and I don't know of any others that do.

~~~
nitrogen
So would moving the floating point value into an integer register, shifting,
incrementing, ORing, etc. ever be faster than a full FP addition? I imagine
that modern desktop CPUs have effectively-single-cycle multiply, but what
about something like ARM with VFP but no Neon?

~~~
astrange
Only in architectures where FP and integer values are kept in the same
registers. This is true in SSE2 and Altivec, so you can do integer operations
there - but since SIMD integer operations are limited too, it's pretty much
only useful for flipping the sign bit.

Moving values between different register sets is INCREDIBLY slow since it
involves at least two memory operations.

And faking FP operations with specialized integer code on something with soft-
float like ARM might be worth it. I've never done it so I can't really say.

------
chipsy
I got almost all of it wrong. I'm not very familiar with gcc's workings and
usually just figure that the compiler is going to be dumb and fail at
everything but the simplest optimizations. But then, when I actually go to do
the tedious and source-obfuscating micro stuff I usually check my assumptions
first - you kind of have to anyway, once you go that deep.

~~~
JoachimSchipper
The point of most of the optimizations mentioned in the article is not to make
programs run faster; e.g. any competent programmer can hoist constant
functions out of a for loop himself, but knowing that the compiler will do
this allows you to write cleaner code.

(This does not apply to e.g. auto-vectorization, which replaces 'x[0] = y[0] *
z[0]; x[1] = y[1] * z[1]; ...' by a single (SSE?) instructions which
calculates all of that at once. Such instructions are much faster and would
require dropping down to assembly without a good optimizer.)

------
robryan
Great article, only really have a basic knowledge of C from uni, but easy to
follow along and learn something.

------
pquerna
great post with a nice quiz format!

~~~
gwern
I wish more posts used quizzes; they make it that much harder to lie to
yourself.

------
Aegean
He could have added the switch statement optimization. That's my favorite. GCC
4 can optimize switch statements with sequential keys e.g. (1,2,3,4 ...) into
a jump table. This was AFAIK introduced in GCC 4

~~~
ramchip
But he did: _The switch statement was optimized to a jump table, while the if
statements became a long sequence of compares._

------
igorgue
#2 was an error I had, and a co-worker told me screaming that I was a crappy
programmer because I did it, now that I know it's optimized I feel like he
owns me an apology or maybe not :-)

~~~
yuvi
Were you using gcc?

I wouldn't be surprised if e.g. MSVC didn't perform that optimization since it
doesn't have the pure function attribute.

