
CPU Cache Essentials - strelok7
http://meta-x86.blogspot.com/2014/06/cpu-cache-essentials.html
======
js2
Seems obligatory to link to "Memory part 2: CPU caches" [1] from Ulrich
Drepper's "What every programmer should know about memory" [2].

1\. [http://lwn.net/Articles/252125/](http://lwn.net/Articles/252125/)

2\. [http://lwn.net/Articles/250967/](http://lwn.net/Articles/250967/)

------
astigsen
Direct link to the original presentation by Scott Meyers:
[http://vimeo.com/97337258](http://vimeo.com/97337258) (the article is just a
summary)

~~~
bainsfather
I tried watching the talk, but he goes _very_ slowly - could be compressed 3x
easily.

The summary gives you the content of the talk much quicker.

------
gioele
> use as much of cache line as possible.

> For example in the next code when else clause is happening, we are throwing
> complete cache line which was fetched by accessing the is_alive member
    
    
        struct Obj {
            bool is_alive;
            …
        };
    
        std::vector<Object> objs;
    
        for (auto o: objs) {
            if (o.is_alive)
                do_stuff(o);
            else {
                // just thrown a cache line
            }
        }
    
    

So, how do you fix that?

My first idea would be to remove the is_alive member from the objects and
creating a is_alive array of boolean, one for each object. But this opens a
whole new set of problems (how to map from objects to is_alive indexes, static
vs dynamic set of object that can be alive, etc.).

~~~
ANTSANTS
If you really want to optimize data cache usage, use a bitvector and (either
through compiler intrinsics or inline assembly) your CPU's bit-scanning
instructions. This can reduce the storage requirement to 1 bit per object and
skip up to 32 or 64 not-alive objects in a single instruction.

[https://news.ycombinator.com/item?id=6176091](https://news.ycombinator.com/item?id=6176091)

~~~
devbug
As Arelius mentioned, it is more performant to just keep an array of alive
objects as you're only touching memory that needs to be touched. That said,
however, bitvictors/bitsets are incredibly useful if you have no easy or clean
way of separating variations of objects.

An example would be "scene-graphs" in my game engine. Ostensibly they are
stored as components (but my game engine doesn't have components) which are
accessed by an unique identifier via an indirection table. Something like
Bitsquid.[0]

Like Bitsquid, my "scene-graphs" can be linked to other "scene-graphs." This
allows me to "glue" weapons to character's hands or armour to their torsos.
This means I have "scene-graphs" that can be updated (nodes repositioned) in
parallel (as they have no data dependencies) and others that cannot (linked
"scene-graphs"). For performance reasons, I identify which "scene-graphs" can
be updated in parallel, so I can kick them off to appropriate tasks which
satisfy their data-dependencies. Thus enters the bitvector/bitset because
copying my "scene-graphs" around and then adjusting other "scene-graphs"
pointers (that point to the moved "scene-graph") when linking or unlinking
results in a complete iteration (once if you batch) over all the "scene-
graphs," or some form of indirection (some with memory fragmentation too!).
Not to mention such approach complicates the code for handling "scene-graphs"
a lot. Beyond that, I have to sort my "scene-graphs" (pointer list) by link
depth when updating, so I don't update children before their parents, and the
like. It's messy, but bitvectors/bitsets are a great solution.

I am experimenting with other methods to manage linked "scene-graphs" which
fall under a split (linked/unlinked) array method, but I don't have anything
yet, especially when I consider that I get a free iteration over all scene-
graphs anyways (which hides a lot of the cost of the current method).

[0]: [http://bitsquid.blogspot.ca/2011/09/managing-decoupling-
part...](http://bitsquid.blogspot.ca/2011/09/managing-decoupling-part-4-id-
lookup.html)

Edit: And I completely forgot to mention the performance concerns of having
multiple cores touch the same memory: irregular stalls!

------
phserr
I see cache related optimizations being discussed most often in the context of
C or C++. Being inexperienced in this area, I wonder how would does it affect
garbage collected languages/systems. I understand that that are different
approaches to GC, but can someone recommend some discussion to read on this
topic? (cache optimization either in GC development itself or in development
with GC languages)

~~~
voidlogic
Are you concerned about garbage collectors that do relocation/compaction or
what? Otherwise garbage collection doesn't really effect much of what is
discussed here (except of course if you are preempted by the GC running and it
empties the cache of useful things (to you), but this is rare). Maybe when you
talk about GC you mean the heap allocator (which of course is related but
different)?

~~~
phserr
What I meant was just how having a less direct control over memory (like you
do with c/c++) affects (edit: in a practical way) the efficiency of cache
usage. Having less control means being unable to pack things together as much
as one would want, but I'm sure sophisticated GC algorithms
(relocating/compacting/generational/incremental) directly or indirectly deal
with that somehow. As I said, I'm inexperienced with GC and my understanding
of them is probably a bit naive.

~~~
voidlogic
Just to you are aware, you are presenting a false dichotomy. There are plenty
of languages that are both garbage collected and give you quite a bit of
control about memory layout (D, Go, Rust, etc).

For example, in Go if I wanted have a slice of _Foobar ([]_ Foobar) but wanted
to have them allocated close to one another I could write:

    
    
      const neededFoobars = 100
      backingSlice := make([]Foobar, neededFoobars)
      seqfoobarPtrs := make([]*Foobar, neededFoobars)
      for i := 0; i < neededFoobars; i++ {
          seqfoobarPtrs[i] = &backingSlice[i]
      }

~~~
dbaupp
(FWIW, Rust is not garbage collected.)

~~~
voidlogic
Rust has optional garbage collection

~~~
FreeFull
There is a library type that does reference counting. It's not built into the
language, beyond happening to be in the standard libraries.

------
gamebak
Well when working with entities/objects you are not really getting more
efficient at the cache level than an array. As seen in this example with array
parsing.

Make sure to use simple arrays when you want to store some data, it's not just
in C but it will work well in other languages as well including javascript,
php, python, perl, etc

Because that's how processors algorithms are designed.

