
What every programmer should know about memory, Part 1 - Anon84
http://lwn.net/Articles/250967/?rss=1
======
sho_hn
If you want to read something more approachable that can serve as a gateway to
this, I highly recommend Jon Stokes' _Inside the Machine_. It goes over
microprocessor architecture in a very approachable way, and contains a good
chapter on the memory/cache hierarchy that is sort of the light version of a
good chunk of Drepper's paper, or at least will be a big help in making sense
of it. If you've never been close to the metal, or you want to catch up on
newer developments, consider checking it out.

Also, here's Drepper's paper compiled into a single PDF, which I think was
edited a little later than the LWN publication and might have had some minor
errata fixes: <http://www.akkadia.org/drepper/cpumemory.pdf>

~~~
wmf
As much as I liked Stokes's articles in Ars, I think programmers should
probably skip _Inside the Machine_ and go straight to Hennessy and Patterson.
_Inside the Machine_ is loaded with analogies that I find annoyingly
distracting; I'd rather learn how a computer actually works than learn an
analogy about how it works.

------
klochner
Very useful information, but this is way overkill for what "every programmer
should know".

Every programmer should have a basic understanding of latencies.

This is slightly out of date but still gives an idea of relative latencies:
<http://norvig.com/21-days.html#answers>

~~~
prophetjohn
Exactly. This is a significant portion of a graduate-level computer
architecture course. You average Rails/Node/etc. programmer doesn't _need_ to
know half of what's covered during the 3 weeks of an undergraduate
architecture course that's devoted to the memory system.

How many programmers in 2012 are ever even going to need to cache-optimize
their code? 25%? 10%? The content is fascinating, but the title is pretty
wrong.

~~~
ajross
I hear this so much, but it's just wrong. Here's a "interview style" question
which is hopefully practical enough to show why:

Let's say you have a giant in-memory balanced binary tree (for your data
store, maybe) with mean depth of ... 27, say. And you're deployed on a machine
with 4 DRAM channels populated with (for the sake of argument) 7-7-7-23 DDR3
memory. Give me a ballpark estimate for how many reads per second you can
achieve from that tree.

My experience is that virtually no one can answer this question well, and most
programmer's intuition is wildly off (they think it's much faster than it is).
And even the ones who get it mostly right usually just use "100 cycles" as
their assumption of DRAM latency and get tripped up on the subtleties (like
the interaction with the number of channels vs. precharge overlap; or the
already-active row latency for sequential access which this question doesn't
cover).

Reading this paper will teach you how to answer that question from first
principles. Surely that's worth a few hours of your time. At the very least
it's worth more than arguing on the internet about how you _don't_ need to
know it.

(edit: A bonus followup question -- hopefully just as practical -- would be:
"So maybe binary trees aren't great. Given a node data size of N, what is the
most appropriate fanout to use in a B tree implementation?". This pulls
knowledge of caching architectures into the mix.)

~~~
btilly
Sometimes the job of an interview question is to make it clear to the
interviewee that this is not, after all, somewhere they want to work.

That would be my reaction to those interview questions.

As for the follow-up question, my response would be to ask why you are
optimizing your B tree for a specific architecture instead of using a cache-
oblivious B tree. (Yes, I can think of several reasons why not to. But if the
interviewer is asking me BS questions, I want to know whether their knowledge
is similarly good.)

Of course in practice I'm generally happy to use BerkeleyDB - the B tree has
never been a performance bottleneck for me except when it involved disk
latency. And at that point a more fundamental reorganization of algorithms
becomes necessary.

~~~
cbsmith
Cache wasn't mentioned in the question... there might be no cache at all. ;-)

Cache oblivious B-trees aren't exactly oblivious to how DRAM behaves, and
sometimes their operation actually can be counter to what works best in DRAM
(though I agree that generally they tend to do better than data structures
that are "oblivious" in a very different sense).

But let's ignore that for a moment. Okay, you aren't using BerkeleyDB, MySQL,
or any other of the common B-trees out there, but instead you've got your
hands on a cache oblivious b-tree. What is the approximate optimal fanout rate
and what is the approximate number of reads per second you can achieve with
the 7-7-7-23 DDR3 memory (or memory of your choice that is commonly found in
modern systems)?

The question _isn't_ really about knowing the specifics of a particular memory
infrastructure. It's about understanding the cost of going to memory, which is
increasingly as important as understanding disk latency (you know how "disk is
the new tape"? well, RAM is the new "disk").

Interestingly, a lot of databases these days are CPU bound , particularly when
working with big data, but even in other cases. They aren't _really_ CPU
bound, it just looks like that because of how accounting is done. When you
poke under the covers, you see the CPU pipelines are comparatively idle
because the CPU's are all tied up accessing memory.

Particularly if you are using AWS or other virtualization services, most web
applications these days rely on servicing a majority of their requests
entirely from memory. Being able to understand the performance characteristics
of memory at least within an order of magnitude becomes increasingly
important, because the average rate that you can service from memory becomes a
critical metric for the scaling of your services.

It's more than even that though. As concurrency becomes increasingly
important, locks and atomics become central to a lot of work. The last few
iterations of Moore's law have changed the principles of how locks & CAS
impact performance. It _used_ to be that locks were bad, because you often
burned up CPU going down to the kernel, but more than that because suddenly
you are wasting CPU while you wait for something else to complete. CAS allowed
you to largely sidestep all that CPU wasting and you're a happy man. Then CAS
got integrated in to locks so that locks only really sucked when you had
contention, but now... The real expensive cost of a lock these days is the
memory fence put around it. CAS is better because the fence is scoped to only
the object being swapped, but _even that_ can be horribly expensive if there
is contention on the object. Having a clear understanding of just how big that
cost is relative to burning some more CPU, wasting some RAM, or even changing
the constraints for your code, can be very important.

Trust me. This stuff really matters for any context where performance matters,
and it is only going to get more important in the coming years. Memory has
pretty much already trumped CPU in terms of its importance for most
applications, and the number of cases for which it is more important than disk
is growing at a rapid rate.

~~~
btilly
I believe you.

However I've spent most of my life working within scripting languages. And the
same is probably true of most programmers in the startup world.

If you're able to get away with using a scripting language, then your problems
had better not be performance critical, because your swiss army knife of
choice really sucks as a saw...

~~~
cbsmith
First, the notion that using a scripting language means that performance isn't
an issue is a canard. It ought to mean your revenue isn't directly correlated
to performance, but that's about it. Indeed, if you are using a scripting
language, the odds are _more_ important that you understand the trade offs, as
your lack of low level options means you need to make smart choices about high
level options.

Second, _very_ few startups don't end up making decisions about performance
even in the early days. What? You're using a NoSQL store? Is that because you
just hate yourself or because you thought it'd be more efficient/scale better
than an RDBMS? Why aren't you just storing it all in memory? Wait, you can run
out of that? just let it swap then! Actually, why not just store all the data
in a flat file? You don't even need to sort it, you can just scan through it
whenever you need to, because performance doesn't matter. Why are you using an
associative array there? Do you care about having efficient lookups based on a
key or something?

You see my point.

I get that some people are playing around in a segment of startup land where
programming isn't really that important, but presumably your long term plan
isn't for every startup you participate in to never go anywhere. What if you
are a success? Do you still want to be on the technical side when you've now
got a team of developers and a serious customer base? If not, that's fine and
likely a great career plan, but you aren't a programmer in the professional
sense (which is what the paper is referring to). If you are planning on
staying on that side, either learn this stuff or expect to be cast aside and
not grow with the organization, because you aren't ready for the bigger world
you'd now be playing in.

To be clear here, I'm not suggesting you'd have to pick up a systems
programming language. Heck, when I started my career two decades ago, when
CPU's were a lot slower and interpreters weren't as efficient as they are now,
there was already this notion that "oh, I'm working in an interpreted language
which provides a layer of abstraction so I don't need to know how computers
work". I made a good start of my career (and met a lot of people who had been
doing so for quite some time) essentially kicking those guys out of their jobs
because it turned out doing scripting with an understanding of the computing
consequences of your choices is applicable in a very broad set of situations,
but scripting without that understanding is barely more useful than knowing
how to use Excel.

High level languages actually can kick quite a bit of butt in performance
critical situations, because they let you focus on the bigger picture of how
you are executing, where the bigger performance wins can be had (I've seen
"scripting languages" winning performance bake-offs for exactly this reason),
but you have to know what the heck you are doing. People often make the
mistake of thinking this won't matter. They tend to have very short careers.

~~~
btilly
Within the limitations of a scripting language, of course you try to get the
best performance that you can.

However in general within a scripting language you are not trying to optimize
things at the CPU cache level. You're worried about overall efficiency of your
algorithms, avoiding disk, identifying performance bottlenecks, etc. But the
pervasive use of hashes everywhere and memory hungry data structures makes
careful use of CPU caches pretty much a lost cause.

Furthermore the language does so much behind your back that an analysis of
performance from first principles is very unlikely to be right. Instead you
need to benchmark.

------
grundprinzip
We use this full document very often during teaching in graduate-level courses
and it always helped the students understanding the underlying concepts of
data access. Even though for most of the people a thorough understanding of
DRAM refresh latencies is not important, it is still a very, very important
read for every programmer.

Why?

The answer is easy: Almost everything in modern computers is about locality
spatial and temporal locality. As soon as the complexity of the programs you
write is one level above "Hello World" or in Rails-speak a simple controller
method. This will become important.

It's easy to translate the concepts of aligning data in DRAM because it's
faster to Rails-ish behavior. Assume that you read lot's of data from your
database and you process it item by item. Inside your loop you perform another
fetch from the database, again and again. If you would try to join the data
(speak aligning) than you would need less requests to the database.

Any system architecture that involves handling data will at one point in time
come to the situation where the programmer will think about ordering
instructions, database queries or even attributes inside a c-struct (see the
discussion about why short ruby strings are faster than long...) If you won't
keep anything in your mind from this document, but if you remember that
sequential access and exploiting locality will increase the performance, then
the document made it's point.

Take the hours, read the document, and probably forget most, but take it as an
inspiration, rather than an optimization guide.

~~~
Roboprog
I used to think about locality a fair amount back when I programmed in C.

In Java, I have no idea whether or not the fields in a
copybook^H^H^H^H^H^H^H^H bean are going to be anywhere near contiguous, or
not. Packing and unpacking stuff in a byte array / string gets tedious.

It's a shame low level integration isn't somehow easier in Java, like it is in
C#.

------
intractable
Ahh, the "What every programmer should know about X" articles.

So, Ulrich Drepper thinks everyone should know about the intricacies of DRAM
refreshing. Zed Shaw thinks we should all know statistics. Some other schmoe
thinks we should know about SEO. Et cetera.

Posts like these should be titled "I worked hard and feel good about myself
for knowing X, therefore everyone should know it."

It is blisteringly stupid. Every programmer needs to know WHAT THEY NEED TO
KNOW.

Nobody can know everything. Why don't we all just learn what we need, build
the things which interest us, and stop telling others what THEY need?

Having said that... interesting stuff. Just needs a title change.

~~~
oakenshield
It's on LWN. 90% of programmers who go there really do need to know memory in
the detail Drepper talks about. I see your point though - I fully expected the
title to be linkbait to a hastily written article.

This article, however, is the real deal. Drepper's site
(<http://www.akkadia.org/drepper/>) won't win any design competitions but is a
goldmine of systems programming wisdom.

------
vilya
I'm surprised to see so many people here getting hung up on the _title_ of
this paper, of all things, when there's so much good information in it.

The title is just the author's opinion. Deal with it. And then read the paper
anyway, because even if you don't use it directly it will help make you a
better programmer.

------
ajross
These are old, but great. Drepper's history with glibc is a little checkered,
but his whitepapers (there's an equally great one on the NPTL work from back
when that was being done in glibc) are gold.

Understanding the DRAM cycle is critical to performance tuning in modern code,
and my experience is that virtually no application developers know how it
works.

------
tomerv
I wonder if PCs will ever have non-linear memory.

To explain what I mean, suppose you have a big 2-dimensional array in memory.
As it works now, if you read a cell from memory, the area in memory around it
is brought into the cache. Because of the way that arrays are represented in
memory, what you get in the cache is just cells from the same row (or column).
This means that if you need to access neighboring cells above of below the
current cell you're not making any use of the cache. The way I see it, it
might be useful if the system could somehow instruct the memory to bring a
different-shaped block into the cache. Maybe there will be a whole part of
memory intended for 2-dimensional arrays, and where blocks of memory will be
allocated as squares on the plane (instead of segments on a line). Of course,
something like this would require support from the programming language, which
would need to mark put 2-dimensional arrays in that special part of the memory
(and allocation will be much harder, now that you need to allocate
2-dimensional chunks).

~~~
ajross
That kind of caching architecture exists for graphics hardware already. GPUs
have modes where the natural breakdown of memory is 2D (basically by
interleaving the X and Y bits), and indeed that improves cache locality for 2D
accesses. But not by a whole lot.

The reason caches come in "lines" that are bigger than the words size of the
processor is not an optimization (i.e. they're not deliberately bringing in
nearby memory in the hope that it will be useful). The reason is that the
logic associated with maintaining the cache (index and tag bits) is expensive,
and needs to be repeated for each "thing" you have in the cache. So by making
the "things" bigger (e.g. the 512 bit cache line on modern CPUs) you reduce
the overhead of the bookkeeping.

~~~
Someone
_"The reason caches come in "lines" that are bigger than the words size of the
processor is not an optimization (i.e. they're not deliberately bringing in
nearby memory in the hope that it will be useful)."_

I do not think that is true. There is considerable chance that that extra
memory will be useful. The simplest example to think of are cache lines that
contain program code. The other canonical example is the "for item in array do
item = f(x)" loop.

~~~
ajross
No, there's hardware to do that too (speculative reads), but it's even more
complicated. Really the driver of cache line size is simple efficiency.
Software has to jump through lots of hoops in practice to try to align
accesses to cache lines, and that could be avoided if memory could be
efficiently cached in word-sized chunks.

~~~
Someone
I still doubt that. IMO, cache lines are larger than the largest item a CPU
can read because the probability that the extra data will be needed soon is
high enough to offset the extra work needed to read that larger cache line and
the (few) transistors needed to increase cache line size.

In some sense, large cache lines are just cheap ways to implement speculative
reads.

See for example <http://dl.acm.org/citation.cfm?id=255272> (pay walled, but
the abstract is clear enough):

"A significant observation is that, although increasing line sizes can result
in a higher hit ratio, it also considerably increases traffic to main memory,
thereby degrading the performance."

------
jiggy2011
If every programmer actually had to know the contents of every "what every
programmer should know about X" articles before programming then we wouldn't
have any programs.

------
rtkwe
<http://www.scribd.com/doc/92111472>

I'll just leave this here, PDF version for anyone who doesn't want to mess
with the webpage reformatting.

------
olalonde
This was posted almost 2 years ago:
<http://news.ycombinator.com/item?id=1511990>

------
balloot
It seems that every day on HN there is some post offering something "every
programmer needs to know", yet the post is actually some very in-depth dive
into the minutiae of the author's pet subject.

The other day there was some post on front end/client side programming with a
crazy amount of things everyone needs to know. Now apparently we also have to
know the characteristics of the charging/discharging of capacitors in memory.

All in all, this is WAY too in depth too have the title it has. The only way
you have to know all this stuff as a software engineer is if you are working
on the lowest level of something like an OS kernel or a video game engine.

------
goggles99
Today, this is what every "low level programmer" should know. If you are using
a high level language with built in memory management and garbage collection,
a condensed subset of this will suffice as only a fraction of the content will
be helpful to you.

