
Speed memory access by arranging data to take advantage of CPU caching - jervisfm
http://gameprogrammingpatterns.com/data-locality.html#11
======
modeless
I've recently been using [http://halide-lang.org/](http://halide-lang.org/), a
language designed to separate the specification of an algorithm from the order
of the computations and the layout of the data in memory. It allows you to
first get an algorithm working and then quickly try a huge number of
permutations of the options for data layout, parallelism, and vectorization.
You can get incredible speedups vs. a naive series of C for loops.

~~~
zwegner
That looks pretty sweet, thanks for the link.

This is where I feel functional languages will have a huge advantage in coming
years. As side effects are reified and put in the compiler's control, they can
be modified and optimized automatically.

While TFA is interesting, and a pretty good overview of designing programs
with the cache in mind, it just shows me how far we have to go in language
design. I don't want to manually make all of these modifications, let alone
carefully profile each permutation of all the possible choices. We have fast,
dumb computers, let them do that work.

~~~
skybrian
Sure, this is the myth of the sufficiently smart compiler [1]. The problem is
that such compilers generate fast code until you hit an edge case and then
suddenly they're slow. Today, JavaScript is the canonical example; you never
know how fast code will run without benchmarking, and it changes with each
browser release. Or take SQL for another extreme example, where performance
depends on the index you hit and the query plan the database comes up with.

To counter this, people invent things like the asm.js specification, which
tells you exactly what you need to do to make JavaScript fast, and then we're
programming in assembly again. (Or perhaps C++.)

So there is still plenty of room for languages that aim for predictable
performance. I believe this is where Go and Rust are headed.

[1] [http://prog21.dadgum.com/40.html](http://prog21.dadgum.com/40.html)

~~~
zurn
Actually the problem is that compilers have dropped the ball on optimizing
data representation. They generally don't even try.

Some faint glimpses of light can be seen, like compressed references in
HotSpot or this Halide.

Rust and Go have nothing planned in this area.

~~~
kristianp
Not go or rust, but I'm curious, in ocaml or haskell, I imagine it is not easy
to change data representation without refactorings, because of the widespread
pattern matching? e.g. in ocaml: let (x,y) = point;;

~~~
emillon
That's one of the reasons why record types are recommended instead of tuples.
They don't cost anything at runtime but make it harder to shoot yourself in
the foot.

------
alexhutcheson
If you are interested in seeing specific data on the effect caching has on
memory bandwidth, you might want to check out this paper I wrote back in
2011[1].

Some of my findings:

\- Effective bandwidth is approximately 6x greater for data that fits in L1,
4.3x greater if it fits in L2, and 2.9x greater if it fits in L3.

\- Contention for shared L3 cache can limit the speedup you get from
parallelization. For instance, running two threads for a data set that fits in
L3 results in a speedup of only 1.75x, rather than 2x. Four threads on one
four-core processor results in a speedup of only 2x vs the single-threaded
program.

\- It takes relatively few operations for programs to become compute-bound
rather than memory bound. If 8 or more "add" operations were performed per
data access, we found that the effects of caching disappeared almost
completely, and the program's execution was limited by processor rather than
the memory bottleneck.

The specific magnitude of these results is machine-dependent, but I would
expect the general relationships to hold for other machines with a similar
cache hierarchy.

[1]
[http://www.stoneridgetechnology.com/uploads/file/ComputevsMe...](http://www.stoneridgetechnology.com/uploads/file/ComputevsMemory.pdf)

------
Tiktaalik
First I heard of this sort of thing was from DICE's presentation on "Data
Oriented Design" [http://dice.se/publications/introduction-to-data-oriented-
de...](http://dice.se/publications/introduction-to-data-oriented-design/)

I think they'd differ from the author here that this isn't an optimization to
go to when performance starts to suffer, but a design that you have to start
using from the beginning, because it's a big change to refactor everything to
this style.

~~~
AngusMcQuarrie
There was also a talk on this from Mike Action (Engine Director at Insomniac)
at GDC this year. He echoed the same sentiment that cache misses are the main
source of performance problems in software. I only partially agree, as network
calls are several orders of magnitude slower, and in many services that's
actually your bottle neck, not the cache misses.

------
assholesRppl2
The best cache-aware programming lesson I ever received was in the CS61C
course at Berkeley -- building a cache-blocking algorithm to run a matrix
multiplication function using the cache as efficiently as possible. We
unrolled loops so that the size of each iteration was exactly the size of one
cache block, and saw instantly the increase in FLOpS.

Then we did some OpenMP parallelization. That was cool.

Nice post!

~~~
wting
I had a similar project in a programming for performance class at UT Austin.
We had access to TACC supercomputers and were tasked with finding the CPU's L1
cache size via trial and error. We were testing it by using matrix
multiplication and measuring performance output, and changing the chunk sizes
accordingly.

------
sriram_malhar
Martin Thompson has a blog and a series of talks on "Mechanical Sympathy" \--
being in tune with the machine to extract performance, like a race car driver.

"How to get 100K TPS with 1ms latency":
[http://www.infoq.com/presentations/LMAX](http://www.infoq.com/presentations/LMAX)

Blog: [http://mechanical-sympathy.blogspot.in](http://mechanical-
sympathy.blogspot.in)

------
hadoukenio
If you're interested in this type of thing, I highly recommend "Code
Optimization: Effective Memory Usage" by Kris Kaspersky:

[http://www.amazon.com/Code-Optimization-Effective-Memory-
Usa...](http://www.amazon.com/Code-Optimization-Effective-Memory-
Usage/dp/1931769249/)

------
rdtsc
This is a very good article on the topic.

I hope universities start to catch up and build this, along with distributed
systems computing and networking, into their core curriculum.

Mix multi-threaded programming with cache effects and there is a whole new
world of side effects and un-expected results. There is a side-note in the
article about it, but I think that deserve a whole article on it too!

It is interesting that this kind of plays a bit in the favor of functional
languages. They are often data centric more than code centric. As in your lay
out your data then pass it to your functions as opposed to thinking about the
layout of the functionality as different objects that just happen to
encapsulate data (so this separates and breaks down the data).

~~~
rwmj
I studied this at university in 1996. I hope university courses haven't
"forgotten" this sort of thing since then.

------
mortenlarsen
For those on GNU/Linux who want to know about the CPU caches on their machine
the "lscpu" gives more detailed cache info than /proc/cpuinfo.

I got curious about how lscpu was determining L1 Data and L1 Instruction
caches. So I had a look at the source code of lscpu and found that it was
looking in /sys/devices/system/cpu/cpu _/ cache/ and
/sys/devices/system/cpu/cpu_/topology/

This lead me to "Memory part 4: NUMA support"
[http://lwn.net/Articles/254445/](http://lwn.net/Articles/254445/)

Pretty interesting stuff.

------
nawb
So this is all cool, and it's incredibly nifty to have the hardware insight
while writing software. But these optimizations (loop tiling, loop fusion,
etc) could (and, to my knowledge, ARE) part of basic gcc and java compilers.
Why are they not more commonly used? Why do we have to specifically provide a
flag to say "Hey btw I almost forgot, make my program 50 times faster."

I'm slightly in the dark as to why loop optimizations are not part of the
default compile process.

~~~
RogerL
Optimization takes serious amounts of time - it's generally a combinatorial
problem. If you are not building for production release, do you want to sit
for 5 minutes (or 5 hours for a big system) while the compiler cranks away
optimizing code that you are just going to rebuild in 20 minutes anyway? Plus,
it just doesn't matter for a lot of production code. Finally, if the code is
not in a hot path, it really doesn't matter if it is optimized or not. If you
make a block of code 10x faster, but some other loop costs you 50% of the
performance time, well, you really gained almost nothing. So, heavy
optimization tends to be something that you turn on selectively and
judiciously.

~~~
nawb
That's a good point. Thanks!

------
jaredlwong
Alignment is another key the author forgot to mention. Cache aligning data is
seriously important, even if it means padding stuff like crazy. Caches are
pretty big, but having to load two different cache lines and do some bit
trickery is a real killer.

Also, if you really want performance, go parallel. Cilk is amazing and there
are forks of both gcc and clang with cilk. Cilk is seriously awesome.

~~~
deletes
Of course that doesn't apply to games anymore but to high performance
computing. Systems for gaming vary too much for that to be of any importance.
Maybe if you are doing it for the consoles, but then you have to optimize each
individually.

------
natebrennand
There are some interesting implications on databases in terms of column-stores
vs row-stores due to CPU caching. Besides the I/O improvements, column-stores
benefit greatly from better CPU usage because the data is iterated over by
type (column by column), so the same operators stay in the cache. This allows
data to be pipelined in to the CPU and processed very quickly relative to
traditional row-stores. It can even be efficient to operate on compressed data
[1] if the relevant decoding data can be kept in the cache.

In case you missed it, there was a relevant article last week about how
B-Trees benefit by reducing the number of cache requests.

[1].
[http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf](http://db.lcs.mit.edu/projects/cstore/abadisigmod06.pdf)

[2]. [http://www.me.net.nz/blog/btrees-are-the-new-
black/](http://www.me.net.nz/blog/btrees-are-the-new-black/)

~~~
zurn
Yes, array-of-structs -> struct-of-arrays is a well known manual optimization
pattern. Compilers sadly don't do it, except to cheat on SPECint.

------
deletes
I was curious about this as I have never done any such testing, so I wrote a
simple c program.

The object was a struct with a size of a typical game object( 100B ), then I
trashed the memory by doing a lot of malloc/free using the size of the object.
The two tests were one with an array of objects( contiguous array ) and the
other an array of pointers to objects which were allocated separately. Then I
iterated over the object doing some very simple operation( just enough to
access it, and identical for each object ) and timed this. And of course I
trashed the memory some more for each time.

The time taken ratio is:

array of objects : array of pointers to objects

1 : 1.189

The second test is identical except I made some extra effort to make sure that
every object in array of pointers was not adjacent to any previous one in
memory:

array of objects : array of pointers to objects

1 : 2.483

The difference got much smaller once the operation on each object became more
time consuming.

------
baruch
There is also the topic of Cache Oblivious data structures. Sadly, I'm still
trying to figure that one out but it looks like it could help. Both with the
memory and cache access time disparities and it is claimed that also with the
memory to disk disparities.

If someone knows of an easier to digest explanation than the proof-laden
academic articles I'd appreciate a pointer :-)

------
agumonkey
[http://channel9.msdn.com/Events/Build/2014/2-661](http://channel9.msdn.com/Events/Build/2014/2-661)
A talk by Herb Sutter about cache, ram prefetcher, locality/contiguous
allocation. Very simple and informative.

------
frozenport
Author should mention growing cache size.

~~~
andrewcooke
can you give a hint? i think i understand the article (well, i haven't read
it, but i sometimes need to tailor code to work better with the cache on a
cpu), but i don't know what you're referring to.

~~~
frozenport
Memory speed vs cpu speed doesn't include caching technologies.

[http://gameprogrammingpatterns.com/images/data-locality-
char...](http://gameprogrammingpatterns.com/images/data-locality-chart.png)

