
On lists, cache, algorithms, and microarchitecture - ingve
https://pdziepak.github.io/2019/05/02/on-lists-cache-algorithms-and-microarchitecture/
======
hyperman1
I wonder if it is worth it to create a hybrid data structure: A linked list of
arrays. Say you'd tune it so each array fits in the L1 cache. You don't need a
huge chunk of contiguous memory, but still gain most of the performance gains
of arrays. Iteration gets a lot more complicated, of course.

~~~
htfy96
It's called unrolled linked list.
([https://en.m.wikipedia.org/wiki/Unrolled_linked_list](https://en.m.wikipedia.org/wiki/Unrolled_linked_list)).
Note that if you can make sure each chuck contains sqrt(N) elements, then you
can achieve insertion/deletion/lookup in O(sqrt(N)) time

------
spsx
I wonder if there is any good resources on hardware details for someone with a
backgroud in applied math. Whenever I find an article talking about
cache/branch prediction, I find it hard to follow.

~~~
dragontamer
Your #1 tool is simply a stopwatch. Write programs, time them, and then learn.
See why some programs are faster or slower than others.

Your #2 tool is hardware performance counters. Its much easier to understand
the cache and branch prediction when you use the hardware to count how many
times the cache is hit, and when branches are taken.

[http://www.brendangregg.com/perf.html#Events](http://www.brendangregg.com/perf.html#Events)

Section 5 of Linux "perf" accesses these hardware performance counters for
you.

Write a program, see how many cache-misses it has. Write a slightly different
program, the cache-misses will be different. How did it change your #1
measurement (time to complete the program) ??

Now do that a thousand times, with a thousand different programs trying to
solve the same thing. Bam, now you're an expert.

Its not really hard. Its just an issue of experience. You'll get there if you
work at it, and use the right tools to "see" what is going on.

\------

Why do people talk about cache hits and branch prediction? Because in __THEIR
__experience, counting cache hits and looking at branch predictions results in
performance gains.

Their experience won't necessarily match yours, but you can still learn from
them in general.

~~~
Erwin
While "perf" is great (especially for something like optimizing a shared
library loaded into Python) I had a hard time using any other counter than the
instruction retirement for anything meaningful.

"perf list" list 1300 things, and it seems those counters come and go between
architectures.

Intel's tools are free now, and give you a very nice graphical analysis tool.
You don't need to buy/use the Intel compiler, just have a 1-2 gigs of free
disk space.

Here's a page describing analysis if you want to go beyond why some section of
your code spends 40% CPU: [https://software.intel.com/en-us/vtune-amplifier-
cookbook-to...](https://software.intel.com/en-us/vtune-amplifier-cookbook-top-
down-microarchitecture-analysis-method)

~~~
dragontamer
That's true, but "perf" is more portable between platforms.

For example, I can't use vTune at all, because I run Threadripper 1950x.
Personally speaking, I use AMD uProf instead.

Without knowing the guy's setup, "perf" is the most portable option. The
processor-specific tools will be more accurate, but also more complicated. Its
the Linux tool that tries to work across platforms: AMD, Intel, and even ARM
platforms.

\---------

"Perf" will give you L1 / L2 cache hits, and accurate counts at that. The main
issue is "branch prediction", which requires very complicated circuitry to
really predict. You need to use Intel's trace analyzer for accurate branch
prediction statistics.

For AMD, you need to use instruction-based sampling in uProf for good branch
prediction statistics.

L1 / L2 cache stuff is the beginner level though, so Perf should be a good
starting point in this whole mess.

------
benrbray
This is excellent, and opportune timing for me too! I just finished courses on
microarchitecture and compiler optimization.

Can anyone recommend any similar articles about low level optimization?

~~~
jokh
Michael Abrash's Black Book is a classic.

[http://www.jagregory.com/abrash-black-book/](http://www.jagregory.com/abrash-
black-book/)

~~~
abainbridge
It is a classic book and a great reference if you are optimizing for the
80386. But almost totally irrelevant for modern hardware. Its probably still a
fun read for reminiscence purposes - Michael Abrash was a great writer.

~~~
yaaqs
It goes into pentium architectural points as well which is far more modern
than 386. Built-in math coprocessors, Glorious! Though your point remains.

I am guilty of having a fair bit of nostalgia for this book. Memories of
sitting in the back seat of my parents car in middle school. Pouring through
this book. Fascinated by these inner secrets of machines I knew so little
about at the time. Presented in a way even a 12 year old could mostly
understand.

