
Memory Bandwidth Napkin Math - Epholys
https://www.forrestthewoods.com/blog/memory-bandwidth-napkin-math/
======
Sirupsen
I’ve been fascinated by the 'napkin math' topic recently, but felt a need for
a way to routinely practise. It’s an acquired skill once it’s effortless to do
the order of magnitude calculations in a meeting, or zipping through possible
solutions on a whiteboard (what I imagine Jeff Dean does). Will it be fast
enough? How much will it cost? Does the benchmarked performance match the
order of magnitude we’d expect?

To practise, I created a newsletter last year with a monthly problem that may
be of interest to others wanting to sharpen their napkin math:
[https://sirupsen.com/napkin/](https://sirupsen.com/napkin/)

~~~
gumby
The two big things are a way to conceptualize your problem in a
straightforward way (hard) and to learn to rapidly do adequate (not precise)
arithmetic in your head (easy).

Wait, the math is easy? Sure, if all you are concerned with is the right, not
correct answer. Back when people used slide rules this was common, but now
when you do it it seems to weird some people out. You should develop this
skill anyway because when you see an answer you should be able to tell at a
glance if it's probably right or almost certainly wrong.

Why its simple: the first part is just to keep track of the order of
magnitude.* People who used slide rules always had to to do this, and it's
quick to pick up and pretty easy to do once you're used to it.

Second is just to know a few common fractions and be comfortable rounding
intermediate results to convenient amounts (if you have "86" you might round
it to 81 if you're dividing into thirds or ninths, or 88 if its by 11, or 80
or 90 if you care about 10x and would prefer your error to be a "too small" or
"too big".

Third is to understand those error bars above, and, as you do when you work by
floating point, avoid dealing with incommensurate numbers (this factor is so
tiny I'll just ignore it).

When you get good at this you'll usually be within a few percent of the actual
answer, which is usually enough to decide if it's worth actually calculating
the answer or not.

Example: I remember a discussion a few years ago where we were trying to
figure out if we could fit our product into a certain volume. As we discussed
the parameters, I and a colleague simultaneously said "360 micrograms"
(density was 1 so g = ml). The calculator welder beavered on and a few seconds
later triumphantly said "357 mg". Great, 357 was more accurate than 360, but
about right, but it was clear he was madly off in magnitude. He wanted to
believe his calculator, and checked his work while the rest of us moved on.

* Unless you're a physicist in which case within a few orders of magnitude is probably OK, or a cosmologist in which case all you care about is 10^0

------
russellbeattie
150ms round trip California to the Netherlands. Oof. That's like 9 frames of a
60fps video. The speed of light sucks! Someone should do something about that!

~~~
lightsighter
SF to Amsterdam is 5448 miles, which is 0.29 lightseconds, or 29ms at the
speed of light, or about 2 frames of 60fps video, so 5X faster. :)

~~~
pkroll
I'm shocked no one has pointed out .29 lightseconds is 290ms, not 29ms, and
that you meant 0.029 lightseconds.

------
ComSubVie
That's a really great post!

Good problem description, good introduction, explained code, meaningful and
explained graphs, full source code available, conclusion and even possible
further exercises. I wish (all) scientific papers were that well written.

------
m0zg
Yep. Most programmers today don't realize that "R" in "random access memory"
can turn your RAM into a pumpkin very easily, and reduce bandwidth to below
what you can get reading sequentially _from a budget SSD_.

Another number to put things into perspective: at 4GHz, 60ns is _240_ cycles.
So every time you feel like you don't care about cache locality, try
laboriously counting to 240.

------
jandrese
This is a lot more than just Napkin Math. He writes a synthetic benchmark to
demonstrate the orders of magnitude difference in performance you can see
depending on how you approach a problem.

~~~
pixelpoet
... to get the numbers presented at the end for doing napkin math.

------
ColanR
It's funny, I had this conversation last week as the coworker, but didn't
actually know how to go about calculating this without running tests.

It does seem like there are classes of problems which are completely
bandwidth-limited. I've heard this is the next area of expansion for hardware
tech, but I haven't seen much yet.

~~~
gumby
> I've heard this is the next area of expansion for hardware tech, but I
> haven't seen much yet.

This has been an continuously active area of hardware design since the 1960s
(consider Cray's work on the 6600 and his later work at his own company). The
whole HPC world has to obsess on this issue.

~~~
ColanR
Fair enough. I was thinking compared to cpu clock speed and ram capacity,
which seemed to have been an area of greater focus compared to the bandwidth
between them.

------
hinkley
> Let this sink in. Random access into the cache has comparable performance to
> sequential access from RAM. The drop off from sub-L1 16 KB to L2-sized 256
> KB is 2x or less.

> I think this has profound implications.

I think I agree.

Can anyone here posit a theory _why_ this is true? Is this a consequence of
all the stream processing work in recent generations of processor? Or
something else?

Is he saying that pointer chasing even when the values are in cache is the
culprit?

~~~
staticfloat
I actually don't think this is that profound; when OP is testing datasets that
fit within cache, what's happening is that we are simply not waiting for data
to be loaded in from RAM. Let's look at this from a different angle; instead
of looking at GB/s, let's think about the CPU as a machine that executes
instructions as fast as it can, then look at what can go wrong.

I could write a program in assembly that is simply 1000000 triplets of (load,
add, store) instructions, each reading from a sequentially-increasing memory
location. We could think of it like a fully-unrolled addition loop. My CPU,
operating at 3GHz, supposedly should complete this program in ~1ms (3 million
instructions running at 3 billion instructions per second), but (spoiler
alert) it doesn't. Why?

The answer is that the CPU executes instructions as quickly as it can, but it
spends an awful lot of time waiting around for various subsystems to finish
doing something. Unless you are running highly optimized code, chances are
your CPU, even though "utilization" is pegged at 100%, has more circuits
sitting around doing nothing than it has circuits doing something. One of the
largest contributors to your CPU impatiently tapping its foot is the memory
subsystem, because fetches from RAM are so slow, and the `load` instruction
(and all dependent instructions such as the `add` and `store` instructions) is
completely blocked upon that instruction finishing completely.

To help with this, we have the whole cache hierarchy, with prefetching of
whole cache lines and whatnot, to try and grab pieces of memory that we think
will be used next into cache, such that we can then access them with _much_
lower latency (and therefore higher bandwidth, since the only thing preventing
us from processing more data is waiting around for more data to come in, ergo
latency is directly the inverse of bandwidth in this case).

Therefore, when doing random accesses upon a dataset that won't fit into
cache, we expect the average time-per-instruction-retired to be roughly how
long it takes to pull it in from RAM. Sequential access is faster only because
when I ask for a value and it doesn't exist, I grab not only that whole value,
but the entire cache line at once, such that the next couple of loads only
have to reach out to cache. Smart compilers can place prefetching hints in
here as well, so that future loads are overlapping our accesses to cache.

The reason I find random access into cache having the same performance as
sequential access as not that profound is because it falls out directly from
the above scenario: sequential access into RAM _is_ random access of cache!
The reason sequential access to RAM is fast is because the values are in cache
due to having fetched an entire cache line; therefore randomly accessing those
same cache buckets in a random order is equivalent (from this perspective).

For those that are interested in learning more about the cache hierarchy and
why it's important to organize your algorithms/datastructures such that you
can do a lot of work without your working set falling out of cache, I highly
recommend reading the HPC community's tutorials on writing high-performance
matrix multiplication. Writing a fast GEMM is one of the simplest algorithms
and datastructures that will drive home why we have tricks like tiling, loop
unrolling, loop fusion, parallelization, etc.. to make maximal use of the
hardware available to us.

For those that like academic papers, Goto et. al's paper has a lot of this
laid out nicely:
[https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf](https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf)
For those that like follow-along tutorials, this was a fun one for me:
[http://apfel.mathematik.uni-
ulm.de/~lehn/sghpc/gemm/](http://apfel.mathematik.uni-
ulm.de/~lehn/sghpc/gemm/)

~~~
forrestthewoods
Hi. OP here.

> The reason I find random access into cache having the same performance as
> sequential access as not that profound is because it falls out directly from
> the above scenario

Correct. The behavior can be logically explained. There's no magic involved.

> sequential access into RAM _is_ random access of cache

I actually like your statement even better than my post. Sequential access
into RAM _is_ random access of the cache! What a delightfully profound
statement.

High-performance computing is your specialty. Of course everything in my post
is obvious to you. Elementary even.

If you want to argue the semantics as to whether a logical conclusion
constitutes as profound or not. Well, I guess?

Not gonna lie. Your comment kind of comes across as a long-winded humblebrag.
"Everything OP said is true. I just think it's obvious." Sorry if that wasn't
your intent.

~~~
staticfloat
Sorry, I don't mean for this to come across as playing down your post; you're
right that "profoundness" is completely subjective and there's no use in me
saying "it's not that profound", for that I apologize.

I enjoyed your post quite a bit, especially how far you went in order to get
concrete results that back up the theory. Thank you for going through the
effort of writing this post and educating others on your findings. :)

~~~
forrestthewoods
Thanks, I appreciate that. <3

:)

------
thedance
You'll become a better programmer if you also understand why randomly
accessing main memory is slow and why randomly accessing caches is not slow.

~~~
gumby
...randomly accessing caches _need not be_ slow.

------
smallnamespace
Fast mental math is useful enough that folks in different fields keep
reinventing variants of it, with different terminology:

\- Estimation ('market sizing') is a standard part of management consulting
company case interviews [1]. The reason is because clients will be throwing
you questions and one part of the job is to look smart and give reasonable
answers _on the fly_ , without going to a computer or grabbing a calculator
first.

\- Physicists call them Fermi problems [2]

\- Microsoft (in)famously asked 'How many ping pong balls fit into a 747?' as
a brain teaser [3]. This was common enough that someone wrote a book about
these brain teasers [4].

\- Fast mental math is a standard part of many trader interviews, since you'll
be making split-second decisions under pressure [5]

One technique is converting everything into log10 first, e.g. 3 billion is
about 3 * 10^9 ~ 10^9.5, then you're just adding / subtracting exponents to
multiply / divide. Another way is to always round inputs to 'easy' numbers (2,
3, 5), and calculate them separately from exponents.

A few minutes with a napkin can easily save several hours doing something that
can't possibly be worthwhile [6]

[1] [https://mconsultingprep.com/market-sizing-
example/](https://mconsultingprep.com/market-sizing-example/)

[2]
[https://en.wikipedia.org/wiki/Fermi_problem](https://en.wikipedia.org/wiki/Fermi_problem)

[3] [https://www.inc.com/minda-zetlin/microsoft-changes-job-
inter...](https://www.inc.com/minda-zetlin/microsoft-changes-job-interview-
process-no-more-brain-teasers.html)

[4] [https://www.amazon.com/How-Would-Move-Mount-
Fuji/dp/03167784...](https://www.amazon.com/How-Would-Move-Mount-
Fuji/dp/0316778494)

[5] [https://www.quora.com/Why-do-hedge-prop-quant-funds-ask-
ment...](https://www.quora.com/Why-do-hedge-prop-quant-funds-ask-mental-math-
questions-during-the-interview)

[6] [https://xkcd.com/1205/](https://xkcd.com/1205/)

------
rckoepke
I'm having trouble finding information on the datatype used, 'matrix4x4_simd'.
Can anyone point me to some resources to learn about these?

~~~
inetknght
At the bottom of the blog post, the author links to a Github gist which
includes the C++ source code.

------
fulafel
Where does the 5 GB/s napkin estimate for RAM come from? Its lower than the
pointer chasing fihure of 7 GB/s.

~~~
Taniwha
Unlike just reading a bunch of random data the read instructions can't be
pipelined, the instruction that uses the read pointer can't be dispatched to
the load-store unit until after it's address has arrived in the CPU (two reads
where you know the address can just be queued, and even finished out of order
if the second one hits in a closer cache than the first one)

~~~
fulafel
I think you are describing the pessimal pointer chasing case, that should be
the smaller figure.

~~~
Taniwha
More I'm trying to explain why pointer chasing is going to be slower than
random integer accesses

~~~
fulafel
Yes, it should be lower. Yet there is a lower figure estimated for a mixed
worlkoad.

"A more realistic application might consume 5 GB/s at 144 Hz which is just 69
MB per frame."

But, now I see what it's about: tje 7 GB/s figure is for 6 threads, for 1
thread he gets 2 GB/s.

------
mooibos
Nice. It explains a few "WTF" performance issues I had in the past.

------
andikleen4
Sounds like he's trying to reinvent the roofline model.

A lot of work has been already done in this area.

[https://en.wikipedia.org/wiki/Roofline_model](https://en.wikipedia.org/wiki/Roofline_model)

~~~
BubRoss
He didn't try to 'reinvent a model' he just showed where modern numbers fall
in his simple benchmarks.

------
whois
Oh snap it's Forest! I wonder what he's up to these days. He used to be a dev
at Uber Entertainment and did some cool community engagement.

