
When code is suspiciously fast: adventures in dead code elimination - signa11
http://jeffq.com/blog/when-code-is-suspiciously-fast-adventures-in-dead-code-elimination/
======
sp332
In the book Permutation City, rich people could upload their brains into
computers, but they were running very slowly. A billionaire loaded a copy of
his brain into a supercomputer and ran an optimizer on it. After a few years
it spit out an empty file and said, "This program produces no output". I liked
that book!

~~~
wyager
Greg Egan is one of the most mind-bending authors I have ever come across.
He's like Charlie Stross on crack. I highly recommend his short story
collection _Axiomatic_.

------
Athas
My experience with benchmarking is that the benchmark program must actually
_do_ something with the result (like, printing it). Not only because you
cannot trust a benchmark to do self-validation, but also because you otherwise
end up fighting the optimiser.

You don't have to print the result within the body of code that you are timing
(which would skew the result), but you must have some confidence that the
benchmarking setup is sufficiently close to real-world usage conditions. This
usually involves the computation now knowing how its result is going to be
used.

(In a similar vein, I am also very skeptical about benchmark programs that
hard-code their input, rather than read it at runtime.)

~~~
Ono-Sendai
You can just print the result after you have finished measuring the loop.

~~~
mason55
Unless you're using a language with lazy evaluation

~~~
tomsmeding
In which case there is probably some sequencing function that you can use (seq
or deepseq in haskell, for example)

------
sudhirj
Isn't it normal for compilers to just no-op functions that can be proved to
have no side effects and no return values?

~~~
cbd1984
> Isn't it normal for compilers to just no-op functions that can be proved to
> have no side effects and no return values?

Yes, which is why the volatile qualifier exists: It makes modifying specific
variables a side-effect, which then cannot be optimized away.

~~~
friendzis
No, no, no. > 5.1.2.3 > 2 Accessing a volatile object <...> are all side
effects.

Yes, modifying is a form of access, but too is a _read_. Which means one can
READ hardware register in a loop and expect it to change without compiler
optimizing that out.

~~~
cbd1984
Nothing I said was wrong.

------
TeMPOraL
Well, the lesson here is to not use asserts for that task. Asserts are by
definition checks that are meant not to be included in release versions. If
checking some invariant matters to you in the actual application (as opposed
to just being a debugging convenience), then do a normal check.

~~~
moonshinefe
Well tell that to Blizzard and all the thousands of other developers where
asserts are used in production code, I guess.

~~~
TeMPOraL
I said "release versions", not "production _code_ ".

Please check the very definition of assert() macro.

~~~
moonshinefe
Okay but I'm pretty sure the manual pages for assert functions don't make a
distinction between release versions and production code, whatever that is...

~~~
TeMPOraL
Let me pull out the (relevant fragments of the) source of /usr/bin/assert.h on
my machine:

    
    
        #ifdef	NDEBUG
        # define assert(expr)		(__ASSERT_VOID_CAST (0))
        #else /* Not NDEBUG.  */
        # define assert(expr)							\
          ((expr)								\
           ? __ASSERT_VOID_CAST (0)						\
           : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
        #endif /* NDEBUG.  */
    

Do I really need to explain this more?

~~~
acdha
That's the code, not what moonshinefe was talking about, and we know that most
programmers rarely look at the source for standard library code unless they're
debugging something.

Here's what the assert man page says on OS X:

    
    
         The assert() macro tests the given expression and if it is false, the calling process is terminated.  A
         diagnostic message is written to stderr and the abort(3) function is called, effectively terminating
         the program.
    
         If expression is true, the assert() macro does nothing.
    
         The assert() macro may be removed at compile time with the cc(1) option -DNDEBUG.
    

Note that the “may be removed” warning is at the end, where it's easy to miss
and it mentions a specific compiler flag rather than the concept of a release
build.

On Linux, the man page leads with that discussion:

    
    
           If  the  macro  NDEBUG was defined at the moment <assert.h> was last included, the macro assert() generates no
           code, and hence does nothing at all.  Otherwise, the macro assert() prints an error message to standard  error
           and terminates the program by calling abort(3) if expression is false (i.e., compares equal to zero).
    
           The  purpose  of this macro is to help the programmer find bugs in his program.  The message "assertion failed
           in file foo.c, function do_bar(), line 1287" is of no help at all to a user.
    

You're not wrong but you're being very optimistic to assume that a) every C
programmer, including those grinding away on massive projects, has learned
about this and knows every compiler flag in use and preprocessor macro defined
in every file and b) large applications haven't accreted so much code in
assert calls in that the current maintainers are afraid to turn it off.
Stories like this one pop up every so often in the programming community for a
reason.

------
jiiam
Very cool. I believe that this is a very instructive piece of code to
introduce in a concrete way what a compiler does to optimize your code, at
least to unexperienced people like myself.

I'm very amazed by the simplicity of this example. Are there more serious
optimization that can be seen with simple code?

~~~
rkangel
Try this:

[http://stackoverflow.com/questions/11227809/why-is-
processin...](http://stackoverflow.com/questions/11227809/why-is-processing-a-
sorted-array-faster-than-an-unsorted-array)

Have a look at the "Update" to the top answer, specifically what the intel
compiler does.

~~~
bathory
It should be noted though, that the actual branch prediction comes from the
CPU, not the compiler.

------
satyapr93
Been there, done that. I was comparing clang and GCC. Clang had dead coffee
elimination but not GCC. I was surprised by seeing the speedup by using clang
:P

~~~
userbinator
"dead coffee elimination" sounds more like something for Java code. ;-)

~~~
carlob
Or maybe OP was just thinking: "God I really need to wake up, let's make some
coffee" while typing.

~~~
fourthark
Coffee in, coffee out.

------
adrianN
I wonder why the iterative version isn't optimized away as well. Is it not
inlined because the loop is deemed sufficiently expensive?

~~~
Ono-Sendai
Good question. It's probably not due to the loop being classed as expensive or
not though.

------
stcredzero
Extreme Programming sprang from the Chrysler C3 payroll software project in
Smalltalk. Kent Beck was originally brought in as a consultant to optimize the
software. It was taking 3 days to run a daily batch cycle. He asked the
Chrysler folks if there were datasets and validated results for him to use.
They answered, "The system's not producing correct results yet." His response,
"In that case, I can make it _real fast!_ "

------
lordnacho
Always good to take a look at the outputted binary. There's also other things
lurking that you may be interested in, for instance security snippets that you
might not need.

I've run into this problem several times. Something like a for loop that
doesn't do anything but is supposed to eat up time will easily be eliminated,
making your attempts at measuring the time taken a bit meaningless.

------
ndesaulniers
Don't use assert! Most "release modes" (I was aware of cmake's but looks like
VS as well) entirely remove statements if NDEBUG is defined! Thus you're doing
a calculation and not using the result, so it looks to the compiler like the
calculation can be skipped if there's no other side effects.

------
kazinator
The problem here, I suspect, is that this std::whatever::now() clock is
somehow being treated by the compiler as if it requires no sequencing with
regard to function calls.

It's probably not that Microsoft's compiler is optimizing through the function
pointer, but rather that it thinks that now() doesn't have to be sequenced.
And its return value is just assigned to local variables, which have no
interaction with the function being called. So if now() doesn't have to be
sequenced, the initializations of those variables can be reordered.

This is definitely a bug somewhere between now this now() function is
implemented, and how the compiler treats its definition or declaration. A
clock-sampling function simply must be properly sequenced with regard to
surrounding code.

------
Johnny_Brahms
This was the best thing from my time at uni. We wrote a compiler, and I made
one that, when it met some types of code, would insert highly optimised C. We
had a small competition at the end of the year with lots of small stupid
benchmarks.

It was self-hosting, and I made it insert the funky code on compile. I also
made it do precomputation on most loops.

The benchmarks at the end of the year had very un-even time distribution, and
everybody knew that we had done something funny (especially since our
recursive slow fibonacci was the fastest by a factor of 8).

We knew we would get caught, but it was funny anyway. Someone else won because
their compiler was actually good, but we got a new special "underhanded
sneakyness" prize

------
pcwalton
Function pointers are not barriers to inlining in any modern compiler
optimization pipeline. In fact, function pointer inlining simply falls out of
constant propagation.

------
lmm
Even when code isn't eliminated entirely, microbenchmarking is very hard.
Compare [http://shipilev.net/blog/2014/java-scala-divided-we-
fail/](http://shipilev.net/blog/2014/java-scala-divided-we-fail/)

------
jlas
Article doesn't mention the big difference between the recursive and iterative
Fibonacci solutions, which is the O(2^n) and O(n) respective time complexity.

