

GCC optimization flag makes your 64-bit binary fatter and slower - mudgemeister
http://timetobleed.com/gcc-optimization-flag-makes-your-64bit-binary-fatter-and-slower/

======
alexgartrell
To put this in context, this particular bug makes your binary fatter and
slower in the same way that eating a tic tac makes you fatter and slower. A
single load/restore from memory through a slightly (and only slightly) slower
path is really going to be blown away by every other operation in all but the
most trivial program (imagine LOTS of recursion with _very_ simple functions
(no looping)). Do a single IO operation and you're really talking about
nothing.

~~~
ars
Except that this is intended to make things faster, but actually makes things
slower.

If it had some other purpose, and a slight slowness was a side effect, then
OK. But when the entire purpose is reversed that's a problem.

~~~
alexgartrell
I agree that it's a bug and should be fixed. I just don't agree with the
implication of the title that it's a super huge deal. I mean, all we're doing
is swapping out mov for push and leave. The functionality is completely
unaffected, and the slowdown is so minor that no one would ever notice without
running a ridiculous microbenchmark.

Fix the bug, but don't make a mountain out of a mole hill.

~~~
ars
Where does it say anything about functionality? It just says fatter and
slower, and that's exactly correct.

I don't see any mountain either. Are you reading some emotional context I'm
missing?

------
phsr
It's nice to see someone tell you that their benchmark is flawed and why. Most
try to pass off their benchmark as the end-all-be-all measurement of whatever
they are testing

~~~
JoeAltmaier
Graphs were also flawed - scale grossly exaggerated to make the point. Why not
zero-based vertical scale? See "How to Lie with Statistics".

~~~
masklinn
If the scales were flawed, it's in defense of -fomit-frame-pointer: according
to his bench, it reduces the cycle count by ~0.3% in 32b, but blows it up by
nearly 30% (29.7) in 64b.

I have put both graphs on the same image and scaled them correctly, then
extended the canvas to the whole scale and finally shrunk the graph back to
the original 600px high, this is the result: <http://imgur.com/yG3R9>

64b on the left, 32 on the right, blue is without -fomit and red is with it.
On the 32b graph, you can barely discriminate between with and without,
whereas on the 64b graph you can very clearly see it.

If he lied, his fault is to have dismissed his own findings as less important
than they are.

~~~
ice799
thanks man. i've also updated the graphs on my blog.

~~~
masklinn
Still a small issue: the graphs aren't the same scale (64b goes 0-5; 32b goes
0-4)

------
froydnj
A couple of points:

\- It's hard to reproduce the benchmarking results, as source code for the
benchmarks is not provided.

\- The original bug report was for 32-bit code, not 64-bit code, as the post
assumed throughout.

\- If you compile the code given in the original bug report as 64-bit code
with and without -fomit-frame-pointer, there's no difference in the generated
code.

\- It's not clear to me that the "potential pieces of code" in the article are
actually generatable with real-world C code. Again, not having actual source
code available hurts.

\- You shouldn't be using -fomit-frame-pointer on 64-bit code anyway, as you
don't need the frame pointer for debugging/unwinding purposes on x86-64 like
you do on x86. If the poster had read the x86-64 ABI, this would have been
apparent.

~~~
ice799
Chill, son.

1.) I can provide the codez. I'll add a link to the article.

2.) Yeah. If you read the article, it reproduces on 64bit code, too.

3.) Not true. Try gcc (Debian 4.3.2-1.1) 4.3.2. The version I mentioned that I
used in my post.

4.) etc

5.) Read the article.

I'll add some more shit and reply to you again when its online. I need to eat
breakfast and head to the office but I'll make you happy soon.

~~~
masklinn
Maybe you should provide both graphs on a full scale (from 0 to 4.8) to show
just how little, in your benchmark, -fomit-frame-pointer brings to the table
in 32b (under half a percent using your mean cycles count) versus how much is
lost due to it in 64b (nearly +30% cycles)

~~~
ice799
done, refresh page and you should see em.

------
dfj225
Anyone doing timings in Linux at the cycle level might be interested in the
PAPI library: <http://icl.cs.utk.edu/papi/>

------
stcredzero
It sometimes pays to run your tests at different levels of optimization. There
are optimizer and code generation bugs in compilers. Running your tests at
different levels of optimization can detect some of these.

~~~
pmjordan
_There are optimizer and code generation bugs in compilers._

In my experience these are much, much rarer than the bugs in _my_ code that
are optimisation level dependent...

~~~
stcredzero
Then there's a whole lot more reasons to test with different optimization
levels.

~~~
pmjordan
Definitely. Even better, have them run inside valgrind or a similar tool, as
this type of bug is often laughably easy to locate that way, and sometimes
devilishly difficult to find with more crude methods.

~~~
stcredzero
When I was doing C/C++ we didn't have valgrind, or at least it wasn't
widespread enough for me to hear about it.

------
oliveoil
uhh, what is that disgusting thing on the picture near the top?

~~~
grandalf
I believe it's a ferocious water bug:

[http://www.google.com/images?um=1&hl=en&tbs=isch:1&#...</a>

~~~
ars
It's a Giant water bug. The ferocious thing is a joke.

------
aliguori
The analysis is flawed because it merely takes into the account the fact that
ebp is callee saved verses using a different register that is caller saved.
But ultimately, the register still needs to be spilled somewhere so you're not
changing the overall amount of work done.

Furthermore, it doesn't take into account the fact that by using ebp as a GP
register, you've got an additional register to work with which eliminates
additional spilling.

Basically, there's nothing to see here.

~~~
ice799
Spilling has nothing to do with this. GCC has the chance to pick another GPR
that is caller saved but instead picks a callee saved register.

This decision increases the size of libc by 1% when compiled with -fomit-
frame-pointer[1].

[1] [http://sources.redhat.com/ml/libc-
alpha/2010-07/msg00022.htm...](http://sources.redhat.com/ml/libc-
alpha/2010-07/msg00022.html)

