

Feel the cache size: a definitive experiment - mojuba
http://melikyan.blogspot.com/2009/07/feel-cache-size-definitive-experiment.html

======
aminuit
Experiments ought to be repeatable. Here is a quick bash/perl script to
generate the parts he left out. It expects that you have copied his main
module to a file called main.c in the CWD.

    
    
        echo "Generating funcs.h"
        perl -e 'for ($i=0;$i<=1000000;$i++) {print "int f${i}(void);\n";}' > funcs.h
        echo "Generating funcs.c"
        perl -e 'print "#include \"funcs.h\"\n\n";for ($i=0;$i<=1000000;$i++) {print "int f${i}() {return ".($i+1)."; }\n";}' > funcs.c
        echo "Generating tab.h"
        perl -e 'print "#include \"funcs.h\"\n\n";print "int (*func_table[])(void) = {\n";for ($i = 0; $i <= 1000000; $i++) {print "f${i},\n";}print "};\n";' > tab.h
        echo "Compiling funcs.c"
        time gcc -O0 -c -o funcs.o funcs.c
        echo "Compiling main.c"
        time gcc -O0 -c main.c -o main.o
        echo "Linking..."
        time gcc -O0 main.o funcs.o -o func-time
        echo "Running..."
        time ./func-time
    

I'm using gcc 4.x which seems to generate a slightly smaller function (10
bytes vs. 11):

    
    
        00000000 <f0>:
               0:       55                      push   %ebp
               1:       89 e5                   mov    %esp,%ebp
               3:       b8 01 00 00 00          mov    $0x1,%eax
               8:       5d                      pop    %ebp
               9:       c3                      ret    
    

I ran it on a 3GHz Xeon with 4MB cache (so 2MB per core I think) and I get
roughly the same, but with much reduced compile times.

    
    
        Running...
        code size: 10  time: 30 secs
        code size: 100  time: 48 secs
        code size: 1000  time: 49 secs
        code size: 10000  time: 47 secs
        code size: 100000  time: 51 secs
        code size: 1000000  time: 59 secs
        code size: 10000000  time: 202 secs

~~~
mojuba
First, on your assembly output: mine was a 64-bit system, and there's one
opcode that's different. Ok, doesn't matter.

Second, the fact that your timing is not a lot different for sizes that fit
the cache means your system probably lacks some kind of smarter instruction
pipelining which is, I presume, present on mine. To be honest, I have no idea
why in my case there was such a notable difference, i.e. from 12 to 45
seconds, while in your case it's 30 to 59.

------
comatose_kid
Isn't this 'definitive experiment' just showing that code which doesn't take
advantage of locality of reference is bound to be really slow if it is larger
than the cache size?

I mean, he's just executing trivial code from random locations in memory - not
a typical access pattern at all. Because of this, the experiment exaggerates
the effect of code size on performance. For example, if you had a 1 million
line program that didn't loop, it's performance would not degrade nearly as
much as his results imply.

~~~
fauigerzigerk
Exactly. This experiment shows pretty much the opposite of what it aims to
prove. It shows that code bloat is completely irrelevant as long as locality
is taken into account in performance critical parts of the code.

The "valuable" lession we can learn from this is this: Never access a million
random memory locations in your innermost loop :-)

------
surki
if you want to know more on this topic

"What Every Programmer Should Know About Memory" -
<http://people.redhat.com/drepper/cpumemory.pdf>

~~~
terminus
A detailed (maybe too detailed) look at the interaction between Hierarchies of
Memory and CPUS. Memory Systems: Cache, Memory, Disk [
[http://www.amazon.com/Memory-Systems-Cache-DRAM-
Disk/dp/0123...](http://www.amazon.com/Memory-Systems-Cache-DRAM-
Disk/dp/0123797519) ]

------
PKeeble
The massive difference in random access speeds in comparison to sequential
speeds is the thing that changed in the last 15 years. The gap between memory
clock speed and CPU clock speed has been increasing for all this time, whereas
prior to that they were often similar.

The impact is that random access has suffered and cache has become vital in
coping with the majority of those seemingly random jumps. Algorithms like
quicksort benefit from locality of reference and hence better utilise the
cache than algorithms that theorectically do less work.

Its interesting to see it jump like this, but its not indicative of large
programs so much as it is indicative of what happens when you randomly access
memory and effectively nullify your cache and your memorys DDR properties.
Thankfully most programs don't in practice do this, if they did the programs
would run about as well as they did in the mid 90's on an original Pentium.

------
spitfire
This is something that's been bothering me for the last decade (okay, more.).
Every time I've articulated it in the past I've been dismissed, but it really
is key to good performance.

iirc, the GNOME folks once did some benchmarking and found they could get huge
speedups by trimming a few bytes here and there from the gnome libraries.

I hope more people take the time to consider the implications of code bloat.
I'm not asking for hand written assembler, but just a little bit of thought
here and there can produce huge improvements.

EDIT: This is also why ssd's are so exciting. Yes the're not perfect yet, but
they promise to (all but) eliminate disk latency within the next few years.

~~~
rarestblog
Too true.

"We should forget about small efficiencies, say about 97% of the time" Donald
Knuth

We often forget about the rest 3% too.

~~~
lallysingh
It's a shame that profiling/performance/cache utilization analysis tools
aren't nearly as common/standardized as the rest of a developer's toolbox.

~~~
mojuba
Valgrind/cachegrind are fairly common in the UNIX/C world. Our company uses it
as part of the automated build/test run.

~~~
lallysingh
Sadly they only support: X86/Linux, AMD64/Linux, PPC32/Linux, PPC64/Linux

Which doesn't help me :-(

------
jongraehl
This article is not really about the memory cache hierarchy.

It's an excuse to show off the author's exciting new algorithm for computing
the value of the integer "max": (func_table[max] - func_table[0]).

~~~
paulgb
Isn't it really calculating max * 11, since the functions pointed at by
func_table[max] are 11 bytes long?

I'm not much of a C coder, is there a more idiomatic way to do this?

~~~
mojuba
It could be max * 11, yes, but with optimizations turned on the compiler could
produce something that wouldn't be that easy to calculate in a more general
case. It's just that a file with one million functions is practically
impossible to compile with optimizations.

~~~
paulgb
Right. What I meant was that (func_table[max] - func_table[0]) is equivalent
to (in the example assembly given) max * 11.

When I asked if there was more idiomatic way to do this, I meant more
idiomatic than the original code, not actually using (max*11) in the code.

~~~
mojuba
Not sure I understand your question, but all kinds of juggling with pointers
is I think idiomatic in C.

~~~
jongraehl
Since not every one got it: in C, the difference between two pointers of type
(T *) is the difference in (byte) addresses divided by sizeof(T).

In other words, pointer subtraction gives an index (of integral type
ptrdiff_t, which should be 64bit if pointers are). For all p2 and p1 pointing
to the same type, p2==p1[p2-p1].

~~~
aminuit
Not exactly. The standard only defines pointer subtraction between two
pointers that point into the same array object. That's not what's going on
here. If he had written

    
    
        ((func_table+max) - (func_table+0))
    

then yes you would just get max. The difference is that (func_table[max] -
func_table[0]) is actually

    
    
        (*(func_table+max) - *(func_table+0))
    

which is subtraction of two arbitrary function pointers. It's not covered by
the standard. In this case GCC is just subtracting the two addresses, with no
division. His results (and mine above) bear this out. If you compile with
-pedantic you'll actually get a warning about it.

~~~
jongraehl
I'd noticed that I misread the original code and replied to my top-level post.

But that's an interesting point about the standard - perhaps it's intended to
allow a smaller sizeof(ptrdiff_t) than sizeof(void *), where no contiguous
allocation would be allowed such that you the larger type.

------
c00p3r
<http://people.redhat.com/drepper/cpumemory.pdf>

