
A cache miss is not a cache miss - ingve
http://larshagencpp.github.io/blog/2016/05/01/a-cache-miss-is-not-a-cache-miss
======
halomru
That's a nice demonstration. I would even go a bit further with the
conclusion/lesson: Cache misses are not evil.

When data isn't in a register and isn't in the L1 cache, it takes a lot of
time to fetch it from other caches or about 200 clock cycles if it comes all
the way from main memory. We measure that event as a cache miss. But modern
x86 processors will go to great lengths to execute the rest of the program
while waiting for the data to arrive. A cache miss only really slows the
program down if there aren't enough nearby instructions that can be executed
while waiting for the missing value to arrive.

You could likely write a program that triggers a cache miss every 30 clock
cycles but runs at the same speed as a program without cache misses. In a
different program, a cache miss every 30 clock cycles can mean a slowdown by
two orders of magnitude. Cache misses are only a useful metric to give us an
idea where to look, not to show actual problems.

~~~
onetimePete
There is only so much work to go around, before cache misses turn into waiting
games, even with branch prediction and micro-code piping.

What is your suggested alternative metric?

~~~
wyager
Worth noting that modern processors aren't just pipelined; each core is
composed of multiple pipelined units operating in parallel. Using the Tomasulo
algorithm and derivatives, modern processors can execute a ton of instructions
before instructions that appear before them are able to complete. The biggest
issue is a branch that depends on a slow operation, which forces you to throw
out tons of work if you mispredict.

~~~
audi100quattro
Could you define slow operations? I am trying to better imagine a CPU having
to throw out work where it is basically trying to get from a cache-miss
(failure) to cache-miss (failure) as fast as it can.

~~~
viraptor
I think what wyager meant is that on a mispredicted branch you have a CPU
pipeline already filled with data and partial results (scheduled loads, r/w
register conflicts, etc.) Now that we're at close to 30 stages per
instruction, flushing all those partial results because your `if` went in a
different branch than usual can get costly. In practice, after a branch
misprediction you'll have to wait at least [pipeline_stages] cycles until you
see the next instruction run. But if the scheduled instruction was something
complex that comes from SSE / AES-NI / etc., you may have to wait even longer.

------
wmu
Good article. A few years ago I wrote a toy program which demonstrates cost of
cache misses (or the power of hardware prefetchers).
[https://github.com/WojciechMula/toys/tree/master/cache_test](https://github.com/WojciechMula/toys/tree/master/cache_test)

~~~
signa11
> toy program which demonstrates cost of cache misses

never realized 'or' is a keyword... :)

------
userbinator
It reminds me of this article, where adding prefetching to attempt to tell the
CPU where to go next for a liked-list operation actually had a slightly
detrimental effect on performance:
[https://lwn.net/Articles/444336/](https://lwn.net/Articles/444336/)

------
gumby
Cache impact of linked lists can be mitigated via cdr coding, originally
developed to save RAM:
[https://en.wikipedia.org/wiki/CDR_coding](https://en.wikipedia.org/wiki/CDR_coding)

------
kjaleshire
Very cool. Herb Sutter gave a talk [1] a couple years ago in which he
advocated being mindful of the CPU prefetcher when dealing with large vectors,
esp. vectors of pointers. Essentially as long as memory is being accessed in a
predictable manner the pre-fetcher can keep the processor fed.

An aside, because instructions are accessed sequentially (branches aside), I
would intuit that data misses are much more common than instruction misses. If
this is incorrect I'd be interested in an explanation or further reading.

[1]
[https://channel9.msdn.com/Events/Build/2014/2-661](https://channel9.msdn.com/Events/Build/2014/2-661)

------
alain94040
If you want to measure cache miss effects, you really should rewrite this code
in C, so you know exactly what gets allocated in memory with what layout.

Without more details on data layout, you could be seeing the side-effect of
aggressive pointer prefetching, which may only be possible with one layout but
not another. Or there could be a fixed stride for some of the data allowing
prefetchers to kick in. It's hard to tell and deserves some more experiments
to isolate what is going on.

~~~
chubot
The only type used in the demos is vector<pair<int,int>>. Vectors are
guaranteed to be contiguous in memory. I'm pretty sure pairs are too, but
that's easy to test:

    
    
        #include "stdio.h"
        #include <utility>
        printf("%zu\n", sizeof(std::pair<int, int>));
    
        8
    

So it's the obvious layout you would see in C with an array of struct { int x;
int y }.

The cling interpreter is good for this kind of thing.

~~~
MaulingMonkey
20.2.2 Pairs &para; 1 (ISO/IEC 14882:2003):

    
    
      The library provides a template for heterogeneous pairs of values. The library also provides a matching
      function template to simplify their construction.
    
          template <class T1, class T2>
          struct pair {
            typedef T1 first_type;
            typedef T2 second_type;
          
            T1 first;
            T2 second;
            pair();
            pair(const T1& x, const T2& y);
            template<class U, class V> pair(const pair<U, V> &p);
          };
          
          pair();
    

tl;dr: It looks like the 03 standard _requires_ it to be equivalent to the
plant C version in terms of layout, although that could involve padding
between first and second if e.g. T1 = short and T2 = int, just like in the
equivalent C code.

------
netheril96
Is the comparison really fair? On a 64-bit system, the `std::unique_ptr<int>`
takes 8 bytes, while a linked list node takes at least 24 bytes (as it must
have a previous and next pointer). The larger sizes means less number of
values can be in the cache at any time and may be the real reason for the
slowdown.

~~~
larshagencpp
This is the reason that the problem was simplified to using
vector<pair<int,int>> for all the experiments, so no difference in data size.

------
ryandrake
Sets a good example to MEASURE things rather than just relying on "old wives'
tales" of programming that everyone believes because everyone believes them.
So many times, I've asked someone why they used some weird algorithm, why they
wrote their own linked list, or why they unrolled that loop, and it's always
"But it's faster!" Well, did you measure it to actually see if it's faster?

One nit: Please, please, stop perpetuating the use of the non-word
"performant." I cringe every time I hear it (now even in person at work!)
Using it just makes you sound dumb, and clearly the author is not dumb.

~~~
paulddraper
Don't read Shakespeare. He made up gobs of useful words.

~~~
sixo
Or committed a lot of the vernacular to print, when that was something that
wasn't done.

------
webaholic
I think this is the reasoning behind preferring SOA vs AOS.

~~~
userbinator
How are Service-Oriented Architectures and... Architecture-Oriented
Services(?) relevant to cache misses? There's more locality in the former,
because each service is more specific?

~~~
Kronopath
Structure of arrays vs. array of structures. I.e. Having one struct that
contains several arrays as its members, vs. having one array that contains
several structs as its elements. The former keeps like data together, so when
you need to, for example, iterate over the position vectors of a bunch of
objects in a 3D graphics program, all the position vectors for all the objects
are tightly packed in the same array, increasing cache efficiency.

~~~
userbinator
I've always called those "parallel arrays" \--- and they were around long
before Service Oriented Architectures:
[https://en.wikipedia.org/wiki/Parallel_array](https://en.wikipedia.org/wiki/Parallel_array)

------
onetimePete
The real trouble is the inability of chips to identify what haskell could
completely avoid, doing the same algorithm for the same input repeatedly.
Hashing over input skip and save output, that would be a optimization. Also
Tools that give Programmers a DataOrientated-VisualFeedback of the OO-
Cluttercosts they impart on there work.

A flag that guarantees side effect freedom to a set of operations and
suboperations, for the processor that would be great.

~~~
paulddraper
Hashing is expensive.

~~~
thrownaway2424
Not as expensive as a cache miss.

