
Gallery of Processor Cache Effects - KirinDave
http://igoro.com/archive/gallery-of-processor-cache-effects/
======
drmpeg
When I worked at LSI Logic, we had a dual core processor that didn't have any
cache consistency at all between cores. You actually had to pad all memory
allocations to a cache line so that the two cores didn't interfere with each
other.

Unfortunately, we also let our customers develop their own code. One of our
best customers (Motorola) starting having problems with one of their projects.
Small changes in their code would cause all kinds of bizarre crashes.

I was asked to help out and after reviewing their code, I realized that nobody
had told them about the cache coherency issue. After making a few changes
(just moving .bss and .data variables into cache line protected malloc's), all
the problems disappeared.

The funny part was when their engineering manager found out. He calls me on
the phone and starts to chew me out. He was furious, so I put him on speaker
and called my managers into the lab. We stood around and just let him rant for
a few minutes until he ran out of steam. A legendary episode that we chuckled
about many times thereafter.

~~~
monocasa
The PortalPlayer 5020 in my iPod mini was designed similarly. Two ARM7TDMIs
with non coherent caches. I tried to hack up iPodLinux to support both cores
in a general way, but never really got it working well. The official iPod
software used one for the OS and dedicated the other to codecs IIRC.

~~~
drmpeg
Same situation on the LSI Logic (actually, it came from C-Cube Microsystems)
processor. One SPARC core running VxWorks and applications and the other SPARC
core running bare metal and MPEG-2 codecs.

The first version of the chip was introduced at CES 2001. Pretty early for a
multi-core CPU. Only 150 MHz clock rate though.

------
dang
2014:
[https://news.ycombinator.com/item?id=8778990](https://news.ycombinator.com/item?id=8778990)

2010:
[https://news.ycombinator.com/item?id=1094797](https://news.ycombinator.com/item?id=1094797)

(Those are for people who like to look at old discussions. Reposts are ok
after about a year on HN:
[https://news.ycombinator.com/newsfaq.html.](https://news.ycombinator.com/newsfaq.html.))

------
utopcell
Nicely written post that summarizes the major pain points wrt cache
hierarchies.

The canonical reference for diving deeper is "What Every Programmer Should
Know About Memory" [1].

[1]
[https://people.freebsd.org/~lstewart/articles/cpumemory.pdf](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf)

------
Symmetry
I was sort of surprised in example 4 that the compiler couldn't optimize
{a[0]++; a[0]++;} to {a[0]+=2;}, given that the allocation was right there in
the same function as the usage.

EDIT: Should have kept reading, other people apparently had this same comment
and it was addressed. This was for C# rather than C++ so no wonder my
intuition was off.

~~~
rocky1138
It was addressed that it didn't optimize it this way, but no one has clarified
why it didn't. I'm curious!

~~~
ygra
A Just-in-time compiler has to find a balance between compilation time (which
slows down your program or makes it slower to start) and optimization quality.
So a lot of optimizations that are OK in C++ where you have lots of time
because of AOT compilation are not feasible for a JIT.

Things could be different by now, though, with RyuJIT and tiered compilation.
That article is quite old (although processors still work the same, just with
larger caches).

~~~
monocasa
Constant propagation is pretty cheap though and is a common JIT compiler
technique.

~~~
Symmetry
Constant propagation yes. But knowing that that memory location wasn't
volatile, say, is a lot harder and I don't blame a JIT for not being sure.

~~~
monocasa
C# very easily knows if a variable like that is volatile.

------
mrob
Mike Acton's CppCon 2014 talk on this subject is still the best I've seen:

[https://www.youtube.com/watch?v=rX0ItVEVjHc](https://www.youtube.com/watch?v=rX0ItVEVjHc)

~~~
faragon
A collection of Acton's videos and documents on Data-Oriented Design (plus
other authors on the same topic): [https://github.com/dbartolini/data-
oriented-design](https://github.com/dbartolini/data-oriented-design)

------
SilasX
This is a really interesting explanation of why some code might have a longer
or shorter runtime than you'd expect. And, as a bonus, it's _not_ titled "What
every programmer should know about ..."

[https://hn.algolia.com/?query=what%20every%20programmer%20sh...](https://hn.algolia.com/?query=what%20every%20programmer%20should%20know%20about&sort=byPopularity&prefix&page=0&dateRange=all&type=story)

------
fulafel
For those of us wishing to play along at home, what's the easy way to get a
disassembly of a single function with C# (Mono or DotNetCore)? The last one
would be fun to figure out.

------
jl6
This was from 9 years ago. Did anybody find an explanation of the last
example?

------
edoo
I always found it slightly frightening when an app depends on cache
characteristics for max performance. It has to be rather isolated on a machine
to be able to expect it to be used.

I've always found it annoying that the cache offers drastic speedups for doing
things 'wrong' in the sense an algorithm is faster in cache than the 'correct'
way until it scales out of the cache.

~~~
vvanders
It's only 'wrong' in the sense that physics limits the speed of DRAM memory
access. If you consider engineering to be the art of designing systems within
constraints then you definitely should be paying attention to cache coherency
and linear access patterns.

Feel free to ignore 50x performance/battery gains at your own peril.

~~~
edoo
By 'wrong' I mean you have algorithm A which is logically less work than
algorithm B, but algorithm B runs much faster if you are in cache, but the
moment your workload doesn't fit in the cache B is running much slower than A.
If you don't hit that line great otherwise you have issues.

~~~
vvanders
> logically less work

Is it? If you're excluding memory access then I would argue it's not a proper
representation of the work performed. You can have an algorithm that's
mathematically ideal but from an engineering perspective is the wrong choice.

Also your workload doesn't always define if you fit in/out of cache. Linear
access algorithms will scale on any architecture/cache size as the pre-fetcher
will step in and bridge your DRAM/network access time. It's essentially like
an infinite cache.

Realtime Collision Detection[1] (which is basically datastructures for 3D
space) does a fantastic job of picking algorithms that are both correct and
cache friendly. Data Oriented Design, SoA/AoS and the like are all techniques
that I think any Software Engineer worth their salt should be familiar with.

[1]
[http://realtimecollisiondetection.net/books/rtcd/](http://realtimecollisiondetection.net/books/rtcd/)

~~~
LifeLiverTransp
There is a whole school of programming by now around not closing ones eyes to
the actual hardware engineering - of course fought against by the church of
moores law, who hoped to finally abstract away all those details into virtual
machines. And they wouldt have got away with it too, if it were not for those
plateauing speed gains.

[https://en.wikipedia.org/wiki/Data-
oriented_design](https://en.wikipedia.org/wiki/Data-oriented_design)

