

Twitter finds Ruby faster with gcc optimizing for size than speed - jim-greer
http://blog.evanweaver.com/articles/2009/09/24/ree/

======
arohner
Linux (the kernel) has been aware of this result for a long time. Here's a
thread from 2001, where Linus says

 _"I would, for example, suspect that a "correct" optimization strategy for
99% of all real-world cases (not benchmarks) is: if it doesn't have floating
point, optimize for the smallest size possible. Never do loop unrolling or
anything fancy like that. "_

<http://gcc.gnu.org/ml/gcc/2001-07/msg01543.html>

and a similar thread
[http://lkml.indiana.edu/hypermail/linux/kernel/0302.0/1068.h...](http://lkml.indiana.edu/hypermail/linux/kernel/0302.0/1068.html)

And the 2.6.15 (2005) changelog which exposes a configuration option to
compile the kernel optimized for size <http://lkml.org/lkml/2005/12/18/139>

~~~
rbanffy
Unrolling loops made sense when memory access was cheap. That is, until about
mid 80s. Since the advent of caches, making your code (and as much data as
possible) fit inside them was the way to go.

------
mrshoe
_> This is an unusual phenomenon_

I don't think that's unusual at all. I've found it to often be the case. Cache
misses will _kill_ performance.

~~~
tetha
Jep, the most impressive thing I have seen about this was when a coworker was
able to squeeze the processing of (small) images into the CPU-cache only (and
improving the steps used to get the image there). The speedup was a factor
around 15.

------
gecko
Apple's default optimization level is also for size, not speed, and that ends
up being faster there as well. They ran an entire session a couple of years
ago at WWDC, explaining how and why -Os ends up being a superior solution--
basically, you blow out the cache less, which ends up mattering far more than
forcing the processor to do as much as possible on every cycle.

------
kmavm
This is an interesting source of "death-of-a-thousand-cuts" performance
problems in C++ codebases, as well. Code you never even execute can slow you
down; stack-unwinding code for unhandled exceptions takes up precious icache
bytes between the bodies of useful functions. Good luck finding this with a
profiler; every function executes epsilon slower, since a cache miss is that
much more likely to fetch its body.

~~~
barrkel
The stack unwinder should be a separate function altogether, not interspersed
with your code.

Activation record cleanup, on the other hand, will be inside your routines,
and will be executed both in cases of normal and abnormal exit. But this isn't
the responsibility of exceptions; you have to do this cleanup even on normal
exit.

Code that runs only in exceptional cases is comparatively rare. But with
exceptions, you can move that code somewhere else entirely; and with PC-based
exception handling, the space cost is only borne when an exception is thrown,
when the PC lookup tables need to be paged in. In this scenario, the tradeoff
between exceptions and error codes becomes relevant: exceptions let your code
get smaller at the cost of a hit when an exception is actually thrown, while
error codes bloat your code.

Of course, all of the above is not specific to C++. C++ has other deficiencies
which can lead to suboptimal pathologies in practice.

------
gruseom
Can anyone mention (or point to) tips on what kind of code makes good use of
cache? I'm looking for interesting details. "Use less memory" is a bit
general.

~~~
vilya
This is far and away the most detailed explanation I've seen:
<http://people.redhat.com/drepper/cpumemory.pdf>

------
zck
On the graph, how can there be more replies per second (the y axis) than
requests (the x axis)? Look at the graph at x=80 (actually, it's slightly
under, maybe 79). There are up to 85 replies.

------
jherdman
Hm... I wonder when REE 1.87 will be made publicly available? My team is
looking to move to 1.87, and I'd rather like to hold off for REE 1.87 if
possible.

~~~
acangiano
REE 1.8.7 should be released very soon. I tried a private pre-release and it
worked like a charm.

------
kierank
Surely icc would help more?

~~~
docmach
Why would icc help more?

~~~
Andys
Not only is gcc -Os faster for every benchmark I've ever tried, ICC is also
some 10% faster again at running ruby interpreter than GCC.

However, none of these speedups is as large as upgrading to Ruby 1.9

