
Cold, Hard Cache – Insomniac Games’ Cache Simulator [slides] - unwind
https://deplinenoise.files.wordpress.com/2017/03/cachesimstyled.pdf
======
unwind
I thought this was impressive, since it's basically single-stepping the code
and disassembling each instruction on the fly to figure out the memory
accesses it generates, in order to update a simulated cache and compute any
misses.

It then integrates with the IDE to give you mouse-over reporting of a single
statement's cache hit performance.

Also, the submission has been edited. I mentioned that the system is open
source, [http://github.com/insomniacgames/ig-
cachesim](http://github.com/insomniacgames/ig-cachesim) is on the page I
submitted, but is on page 105/105 of the slide PDF.

Edit, disclosure: I once had the pleasure of working with deplinenoise, and
tried to sponge up as much knowledge as possible. He's great.

------
CorvusCrypto
It is so amazing how quickly he was able to resolve this tool. 2 weeks to
build a very specialized tool AND perform the optimizations on the code. What
am I doing with my life?

~~~
meredydd
On one hand: Yes, this is an extremely cool hack. Hats off.

On the other hand: It probably seems particularly impressive because its
building blocks don't happen to be in your skill set. A rule of thumb:
Something is impressive in proportion to how many times you think "How did
they do _that_?!".

In this case, if you're not already familiar with disassembly, binary
instrumentation, how caches work, etc, then every step along this way sounds
magical. But to a low-level/systems programmer, this sort of thing is their
bread and butter. So their reaction wouldn't be "OMG how did he even...?" \-
it's probably more like "Niiiiiiice".

(If you are, eg, a web developer, you probably have an equivalently ridiculous
level of background knowledge about the internals of 5+ levels of the web
stack, and how each goes wrong. I have seen systems guys - people who would
jump on this project in a heartbeat - take one look at modern web dev and flee
with their head in their hands, asking "how do they _know_ all that stuff?!")

~~~
bronxbomber92
Heh, not the best example to make your point [in this context] because the
presenter also manages a group that writes/maintains the companies web tools,
which consists of 340,000 lines of Javascript and 500,000 lines of C++ server-
side code.. :)

[https://deplinenoise.files.wordpress.com/2017/03/webtoolspos...](https://deplinenoise.files.wordpress.com/2017/03/webtoolspostmortem.pdf)

My experience has been that people who are good at the bottom layers of the
stack are good with the top layers of the stack, too. And if they're
unfamiliar, they can ramp very up quickly.

~~~
meredydd
[Update - Holy Moses, that presentation is amazing! It's a perfect
illustration of what happens when you _don 't_ have the background of
experience that web devs do, and just throw a bunch of smart but inexperienced
C++ devs at a huge Javascript project. They walked into every rake in that
grass, and I get a distinct subtext of "holy crap, I would never have believed
the tooling was _this bad_..."]

My comment addresses the perception that "OMG programmers today know nothing;
this is what a Real Man looks like". Someone learning web development has
spent a lot of brain-space on learning a level of in-depth knowledge that
_also_ looks magical to someone who hadn't.

As it happens, I too think that the things you learn lower down the stack tend
to make you a better engineer, whereas what you learn higher up the stack is
too often an unedifying schlep. High-level systems spend a lot of complexity
solving problems created by the layer underneath. (This goes 10x for web
frameworks.) Low-level systems are more tightly constrained by the boundaries
of the possible, so they spend their complexity on more fundamental problems.
The same amount of time and intelligence spent learning the ins and outs of
Angular yields less transferable skills than learning how compilers work. But
if you need to build a web startup, compiler expertise on its own won't help
you.

It's like learning physics vs biology. Sure, physics is more fundamental, and
a physicist learning biology usually has an easier time than the other way
round. But a research biologist has _also_ spent their time acquiring an
immense amount of expertise, and fundamentally we need medical advances more
directly than we need confirmation of the Higgs boson.

------
Cyph0n
I'm taking a computer architecture grad course right now, and coincidentally
our first project involved writing a cache simulator. Obviously, the result
was much more coarse-grained than this work.

For our project, we only needed to simulate cache accesses and not the
content. We also kept track of the LRU block, simulated sub-blocking, and
added a FIFO victim cache to the simulation. It was a fun exercise, and I got
to learn a bit more C++ in the process.

The next project involves building a CPU pipeline simulator, which _might_ be
more challenging.

------
keldaris
Having glanced at the code, it seems fairly easy to extend to modern Intel
CPUs, which is great. The biggest caveat is the lack of prefetcher simulation,
which literally makes many seemingly horrible cache misses a non-issue, but
that's a non-trivial extension.

Regardless, very nice to see tools like this being open sourced. The games
industry has a lot of valuable experience with performance-critical code and
I'd love to see more of it shared publicly.

~~~
daemin
If you use this tool to find the outliers which just break the cache badly,
then you can apply some human introspection on the data and figure out what is
a problem and what is not.

Trying to add in a prefetch handler would be painful, time consuming, and may
actually hide some cache misses that you would want to find. Also each CPU
family and potentially in between families there could be changes to the
prefetch algorithm used.

~~~
keldaris
Very true, hence "caveat" not "problem" or "dealbreaker". It's something you
have to constantly keep in mind as you look at the results, some very fast
array iterations will habitually look terrible, etc. Given the features of
modern prefetchers, much less obvious patterns will also look worse than they
are every now and then.

------
zellyn
Coincidentally, I just watched this last night: it's a really accessible
refresher if you're a bit outdated on how all these caches work and interact:
[https://youtu.be/OFgxAFdxYAQ](https://youtu.be/OFgxAFdxYAQ)

~~~
zengid
Great talk, I've been hoping to find something like this because CPU's seem
like they're a dark-art these days. Also, not related, but Cliff Click's
accent is delightful.

------
vvanders
Very cool stuff.

Fun fact: A certain game-dev kit had another machine sitting along-side on the
bus. It enabled you to sample the Program Counter at N=1 for a small amount of
samples(I want to say ~4k) off of a breakpoint.

The amount of perf stuff you could trivially track down was staggering. Load-
Store-Fetch? Boom, there it is. Cache misses? Clear as day.

Never worked on another platform with such an incredible profiler and still
miss it to this day. You could set N to any power of 2 for coarser profiles
but it always shined at N=1.

~~~
ndh2
Any particular reason you would refrain from naming it?

------
CoolGuySteve
A small note, callgrind doesn't necessarily have to be all or nothing.

You can use the CALLGRIND_START_INSTRUMENTATION and
CALLGRIND_STOP_INSTRUMENTATION to start and stop the stats collecting. Start
valgrind with the –instr-atstart=no option to disable collecting on everything
outside of those blocks.

It's still really, really, slow but much faster than when it collects
everything.

------
canteloops
Isn't this pretty dependent upon the specific cache architecture? Definitely
nowhere near an expert here, but I thought that CPUs all had varying sizes /
layers of caches, on top of additional spare implementations - i.e., victim
caches, etc. How much portability does this one-sized tool have, or is the
variation between CPUs not an issue?

~~~
pm215
Yeah, you need to configure it to tell it what configuration of cache you want
to simulate (there's an example of doing this for the Jaguar cache in slides
84/85). Valgrind's cachegrind tool also lets you tweak the cache config it
simulates.

Some bugs are going to cause problems regardless of the cache config (like the
"access a cold page unnecessarily every iteration of the loop" examples); some
might be more sensitive to exact cache configuration.

------
csense
I wish the slides went more into detail about what is meant by a "Jaguar
Cache." It immediately made me think of [1] but clearly no one is writing
games for it in 2017.

[1]
[https://en.wikipedia.org/wiki/Atari_Jaguar](https://en.wikipedia.org/wiki/Atari_Jaguar)

~~~
keldaris
They're referring to AMD Jaguar cores as used in the PS4.

[https://en.wikipedia.org/wiki/Jaguar_%28microarchitecture%29](https://en.wikipedia.org/wiki/Jaguar_%28microarchitecture%29)

------
nfriedly
BTW, the "n" key jumps to the next slide without scrolling. (In Firefox's
built-in PDF reader, anyways - I found that by accident when trying to make
the space bar behave the way I wanted ;)

~~~
Narishma
j/k, n/p and left/right arrow keys all do the same.

------
drudru11
Awesome - I have been waiting for something like this for years.

I am sure the next steps will be better visualization and simulator accuracy.

------
Aissen
Interesting. Sounds like they needed perf. Yes, it's sampling, but its
sampling is extremely fast. Porting to Linux has its advantages.

~~~
unwind
I don't think this can be called "sampling"; it's single-stepping and
analysing each instruction.

There's nothing statistical about it, no "samples" that are hoped to represent
some hidden full system, since the entire instruction stream is analyzed when
the simulator is enabled.

~~~
Aissen
I understand the advantages of their tool, it's interesting to get an
instruction-level detail. It's just that they compared sampling systems that
are far from having perf's granularity.

------
btczeus
Direct, explicit cache control? NO! Tedious workarounds around an ancient
architecture FTW

