Hacker News new | past | comments | ask | show | jobs | submit login
What Every Programmer Should Know About Memory (2007) [pdf] (akkadia.org)
214 points by jxub 15 days ago | hide | past | web | favorite | 97 comments

Is this title and content supposed to be ironic?

I quickly perused the article and I think this link should be renamed "What 99.9% of programmers don't need to know about memory."

I've managed to go from Associate to Principal without knowing 99% of what's covered in this document, and I'm struggling to understand why the average Java, C#, Python, Rust, <insert language here> programmer would need to know about transistor configurations or voltages, pin configurations, etc. Let alone 114 pages of low level hardware diagrams and jargon!

This document is for someone working on low level drivers for memory, or working on the hardware side. For any normal software engineer, this information is not helpful for doing your job.

I would argue that the overclocker needs to know more about these details than most programmers, yes. (Overclockers actually tweak these values to maximize the performance of their computer).

But any high-performance programmer needs to understand the RAS / CAS / PRE cycle, if only to understand WHY the "streaming" of data is efficient, while random-access is very inefficient.

If you are accessing RAM randomly, you better be sure its within L3 cache (or nearer). I've done some experiments, and "streaming" data from beginning to end can be 2x to 3x faster than random access on modern DDR4 RAM.

Understanding the RAS / CAS / PRE cycle helps me understand why streaming data to RAM is faster. And understanding that cells are simply capacitors helps me understand why the RAS / CAS / PRE cycle is necessary in DRAM.

> if only to understand WHY the "streaming" of data is efficient, while random-access is very inefficient.

I think what they're saying is that 99.9% of programmers don't need to know why, they just need to know that streaming access is 2-3x faster than random even without any memory stalls.

You don’t need to know any of that to know that streaming access is more efficient than random access. You just need to know that caches cache localized blocks. Actually, you don’t even need to know that. You could even just be told that streaming access is faster than random access.

This is how cargo cult programming begins.

It's called working on the relevant level of abstraction.

I am all for knowing things just to know things...but why does a programmer need to know WHY the streaming of data is efficient?

Because fundamentally, the DRAM is a major component of the computer, just as CPU cores are a major component of the computer.

Now, I'd personally explain things in a far more simple manner than what was described in the PDF. Here are the facts that programmers need to know:

1. DRAM stores data in very tiny capacitors. These tiny capacitors have two properties: they run out of electricity in just 64ms. And second, they run out of electricity after a SINGLE read operation.

2. DRAM has a temporary location called "sense amplifiers" where data is stored during a refresh or a read. These sense amplifiers can hold data permanently.

3. This "temporary read" is called Row-open (or RAS). Reading from an already open row is called a Column-read (CAS). Sending the data back to DRAM proper is called Precharge (PRE). Remember, the sense amplifiers must be clear before they can read from a new row. (The old data in DRAM was destroyed when you read it with the RAS operation)

4. I guess there's a periodic refresh you should know about: instead of trying to fix all RAM every 64ms, you're supposed to do it in small chunks at a time. Every dozen microseconds, RAM will self-read / self-write to refresh another row. Don't be surprised if your memory-reads randomly stall out an extra few hundred nanoseconds because of this refresh.

The end. Not so hard, now is it?


DRAM is faster when you stream, because you open a row once, fill out all the data in a row, and then send the row back to DRAM. In effect, you only have to do a bunch of "column" writes to sense amplifiers, as opposed to opening-and-closing a bunch of different rows.


So yeah, programmers should know it because its really not that hard to learn :-) And if you start measuring your program at the nanosecond level, you'll actually see these effects and start to demand explanations.


EDIT: Hmmm... the more I think of it, the less its something "programmers" need to know and something "SysAdmins / DevOps need to know". An advanced Sys Admin can use these profiler tools to figure out whether they need that 6x Memory Channel computer or the 8x Memory channel computer on the next purchase.

Is your code memory-bound? Or is it CPU bound? Should you buy more cores? Should you buy more LRDIMMs for higher amounts of RAM? Or is your program latency-bound and actually benefits from the lower latency of RDIMMs or even UDIMMs ?

The programmers kinda don't make those decisions.

I'd say this expanation merits the title "What Every Programmer Should Know", more so than the original twelve or so pages.

So, RAM is the new tape, cache is the new RAM?

I write user-level C/C++ programs and think that this book is 100% relevant. Yes the details about the transistors can be skipped for first-time readers. But eventually these explain what programmers should expect from the machine, and why.

You should get acquainted with 'What Every Computer Scientist Should Know About Floating-Point Arithmetic' after which similar articles are named.

I'd bet you would suggest to rename it too.

Computer Science is a different discipline to Software Engineering.

I would argue that sure, if you want to be a PhD, academic, researcher etc, in Computer Science, then reading said article might be useful.

For engineers working at companies whose goal is to get product to market faster, coding ruby on rails, javascript, making web apps, websites, it's far more useful knowing your web frameworks well and being able to iterate fast, than knowing the intimate mathematical details of floating point arithmetic.

Computer Science != engineering. It's like saying knowing physics is the same thing as knowing how to build a building.

The referenced work talks about floating point errors that both computer scientists and field engineers need to avoid. Many folks in both camps screw up on floating point code without such guidance. So, at least that part is potentially helpful.

You just proved their point about how you'd reply, though.

Proved their point? Their point didn't make much sense.

What actual engineers NEED to know about floating point numbers could be put on a single A4 page. It doesn't have to be 50 pages, where many of those pages are full of equations that 99% of engineers don't understand.

My point is, because you seem to have trouble grasping this. Most real-world programmers, working at startups or bigger companies, making their WPF applications or Spring applications, or react/jquery/angular apps, or Swift IOS apps, etc, do not NEED to know 99% of this document. For most real-life cases all they need to know is "Use the decimal type if you're dealing with money" or something along those lines.

Why do I know I'm correct? Because I work in an organization with 800-1000 engineers, and I know a fair few of them myself, and I bet maybe 1 person in the org has read this doc fully (and even this is a stretch). But the company stills makes billions in revenue every year like clockwork and the world keeps on spinning.

Hence, this document is for theorists and academics, not for the average engineer making enterprise business applications. If it is for an engineer, it's for someone making extremely niche mathematical software or something equally arcane.

> ...you seem to have trouble grasping this.

Please don't insult people like that.

If someone doesn't follow your reasoning, take responsibility yourself and find another way to explain the point so others may understand it more easily.

I know that beginner Ruby programmers are constantly surprised by floating point arithmetic results due to not understanding how they work. I see it all the time.

> What actual engineers NEED to know about floating point numbers could be put on a single A4 page.

I don’t think you could explain the issues sufficiently on one page. I think you'd be left with more questions and confusion than answers.

"Proved their point? Their point didn't make much sense."

"My point is, because you seem to have trouble grasping this, ..."

Ha-ha, the irony.

Mazel tov. There are also people who went to "Principal" without being able to program at all.

Indeed, there are a limited number of hard-hitting topics that are definitely must-haves:

* Rough latency timings

* Caching

* Prefetching

* Sequential vs. random access

* N-dimensional layouts (row/column major and arbitrarily strided)

* Design of cache-oblivious algorithms

* SIMD-able access patterns

* False sharing

* Instruction cache & code size

* Branch prediction and speculative execution

I'd be curious to hear what else folks would put on (or remove from) this list.

Again, you say these are "must haves." But in the real world of software, none of these would make any difference to most of the software being built today, much of which is being build in higher level languages for web applications/web sites/microservices etc.

Let's say someone is building a micro-service in C#.Net. Why would any of this stuff matter to them? The company cares about features and moving forward quickly.

It's rather like forcing all your developers to be able to code every type of different sorting algorithm when they are only ever going to call "mylist.Sort()" in reality.

This seems like another of those occasions when HN misjudges what the vast reality of practical day to day software creation looks like. Most engineers are not writing low-level code, they are not working on hardware directly, and they don't need to know exactly how RAM works at the transistor level.

I'm not saying it's not interesting, but it's not going to change how I write my for-loops and if statements or make calls to BigTable, etc.

> I'm not saying it's not interesting, but it's not going to change how I write my for-loops and if statements or make calls to BigTable, etc.

You might be surprised at how much changing your for loops can be when accessing matrix data.

    for(int i=0; i<10000; i++){
        for(int j=0; j<10000; j++){
            blah[j][i] = whatever(); // Column oriented is very slow.

    for(int j=0; j<10000; j++){
        for(int i=0; i<10000; i++){ // Very fast: We move row-wise now. With luck, this is SIMD-vectorized by your compiler.
            blah[j][i] = whatever();
Test that out, and you'll see a major performance improvement.


If someone is writing high performance code, the #1 goal is to be able to read your profiler's output. If you can't read what the profiler says (cache hits, TLB hits, memory stalls, etc. etc.) then you can't make sense of the data.

The PDF goes deeper than I'd personally go, but there's some concepts here that are absolutely necessary if you actually want to understand what a decent profiler gives you these days.

I have to admit, I was skeptical but my complier did pick SIMD for the second and not the first and it did make a huge difference.

  % cat t.c
  #include <sys/time.h>
  #include <sys/resource.h>
  #include <stdio.h>

  double get_time()
    struct timeval t;
    struct timezone tzp;
    gettimeofday(&t, &tzp);
    return t.tv_sec + t.tv_usec*1e-6;

  int blah[SIZE][SIZE];
  int whatever() { static int i=0; return ++i; }

  int main()
    double t0 = get_time();

    for(int i=0; i<SIZE; i++){
        for(int j=0; j<SIZE; j++){
            blah[j][i] = whatever(); // Column oriented is very slow.

    double t1 = get_time();

    for(int j=0; j<SIZE; j++){
        for(int i=0; i<SIZE; i++){ // Very fast: We move row-wise now. With luck, this is SIMD-vectorized by your compiler.
            blah[j][i] = whatever();

    double t2 = get_time();

    printf("SIZE=%5d dt1=%4.3e dt2=%4.3e\n",SIZE,t1-t0,t2-t1);

    return 0;

  % for (( i=1; i<100000 ; i*=10 )) do gcc -DSIZE=$i -O t.c; ./a.out; done 
  SIZE=    1 dt1=4.179e-04 dt2=0.000e+00
  SIZE=   10 dt1=5.190e-04 dt2=0.000e+00
  SIZE=  100 dt1=5.062e-04 dt2=9.537e-07
  SIZE= 1000 dt1=4.014e-03 dt2=1.490e-04
  SIZE=10000 dt1=1.347e+00 dt2=4.349e-02
2.8G Core i7 1600MHz DDR3

n.b. there would be a difference even without vectorization

Now you could try to time it with -fno-tree-vectorize or whatever gcc uses nowadays. You might be surprised again :)

Yup yup. SIMD is a CPU-level optimization, but the problem (at size 10,000 x 10,000) is DDR4 memory-limited.

SIMD only really makes a difference at the L1 or L2 cache levels. Its a fancy micro-optimization that compilers do and can improve code speed in those cases...

But the "big" change, going from column-wise traversal into row-wise traversal, is the huge memory optimization that programmers should know about. It just so happens that SIMD-optimizations are also easier for compilers to figure out on row-wise traversal, so you get SIMD-optimization "for free" in many cases.

Can you elaborate why the second is faster in the general case? Is this an issue of row major vs column major orientation? Also does your example also hold in the absence of SIMD? Thanks.

There's at least 2 reasons (+1 non-reason) why it is slow.

1. L1 cache lines are 64-bytes long. By fetching column-wise, you are wasting the bandwidth between L1 and main-memory. L1 cache will always fetch 64-bytes. By moving "with" the cache, you allow the L1 --> Main Memory data-transfers to be far more efficient.

2. Virtual Memory is translated by the TLB before it returns the actual value. Moving within a 4kB page is more efficient than moving across pages.

The non-reason:

* Hardware prefetcher probably works, even on column-oriented data.

All of these reasons hold even if the SIMD-optimizer fails. If the SIMD-optimizer is actually working, you'll more efficiently load/store to L1 cache. But this is likely a memory-bound problem and optimizing the core isn't as important.

Make perfect sense, thanks for the concise explanations. Cheers.

If I saw that code at my company, I'd ask why someone building a rest api for web-services had to iterate through a 10000*10000 2D array while calling a method of every single cell. This isn't a use case I've seen in my 10 year career thus far.

Also, you have to understand that saving 1 or 2 milliseconds isn't important when compared to getting software built faster. For the typical use case, optimizing a for-loop like this for in-memory data isn't going to change the total performance of your program by even 0.5%. This is what so many people just don't understand. For modern web based applications, 99.9% of the time spent is doing IO, either hard-drive, data storage, or across the network. Optimizing those calls is FAR more important, than wasting time trying to save 1 millisecond in a for-loop which is iterating over 100 in-memory objects.

This is where the understanding that the practical reality of building software for a real business is very different to theoretical examples drummed up by computer scientists on discussion boards.

I see so many engineers trying to make their in-memory code faster by parallizing it, saving (if they're lucky) 1 or 2 milliseconds, but making their code far less maintainable and potentially introducing bugs, while the users wouldn't notice any practical difference in the app. Meanwhile, the app really does feel slow, because some numb-nuts made a call to some api and blocked the ui thread.

Were I work we have to process a few billion records per day. Knowing about how memory works and writing the code in a way that uses memory bandwidth efficiently, allows us to process all those records in smallish AWS instance instead of having to use Hadoop on a much more expensive cluster of instances.

Hadoop is for "big data." Big data is commonly defined as minimum of petabytes of data. That is, thousands of terabytes. Our ingest of new events to our cluster is upwards of 25B per day and we keep all that data in perpetuity.

Our hadoop cluster is relatively small at 5PB right now, but I would be impressed indeed if you were processing PB of data on a smallish AWS instance.

If you were considering using Hadoop for work that can be processed on a "smallish instance" then I'd suggest you maybe don't really understand big-data or the normal workloads for which Hadoop is intended. For example, we run ML models on our Hadoop cluster which take hours to run distributed over around 100 nodes. Let me know when this can be done on a "smallish instance."

I don't work in big data. But here's what I think of your field anyway :-)

In general, you optimize software by starting from the slowest things first, and then working your way to the faster things when you run out of things to optimize. It seems like your problems are I/O bound, and therefore the bulk of optimizations can be made by simply optimizing your I/O (either through async calls, or better understanding of what the frameworks are actually doing, etc. etc.).

And that makes sense for sure.

The thing is: the next level of optimization is not CPU-optimization, but instead memory optimization. In fact, all CPU-based optimization starts at the Main-memory level. Why?

Because memory is slower than virtually everything inside of the main CPU Core.

Which means, memory-level optimization is a far more important skill than any other CPU-based optimization technique. Modern CPU Cores operate at 4GHz easily, but RAM only responds every 100ns on server systems (that's 400 CPU-cycles of RAM Latency!)


In effect, if there's one low-level optimization any higher-level programmer should know about, it is RAM optimization. Sure, there are CPU-optimizations (SIMD registers, L1 vs L2 cache, and more), but RAM access itself is hundreds of times slower than CPU-speeds and needs to be optimized before lower-level optimizations are sought.

Once RAM access are fully optimized, then you can finally reach for the CPU-core optimizations. Instruction-level parallelism, or SIMD Registers, or what-not.

RAM is the next slowest part of your system after I/O. As such, optimizing RAM access is the most logical next step forward when your programs are running poorly.

> If I saw that code at my company, I'd ask why someone building a rest api for web-services had to iterate through a 10000*10000 2D array while calling a method of every single cell

Why not? I work with the scripting languages only and regularly hit this scenario so I won't say this is completely useless.

When you analyze a few hundred 50GB files for specific patterns, you have to go line by line in those cases shaving off milliseconds, optimizing how data should be accessed and optimized becomes valuable.

I've seen the code written by a Python web-programmer that for a given binary matrix of teams and tournaments produced top 10 rivals for each team where rivals are defined as teams that have picked the same tournaments to participate as your team.

There are tens of thousands teams and thousands of tournaments and his code was taking two days to precalculate the results for every team.

Written with minimal understanding of how slow the memory is, the new code takes less than a minute. There was no point of optimizing it further, the algorithmic changes were enough.

Sure, you don't really need to know any of this stuff unless you're actually needing to optimize code beyond the lowest hanging fruit.

My list presupposes you've hit a wall and need the best performance you can get. That's not always the case, but I certainly wouldn't say that these optimizations don't make a difference in the "real world of softare."

False sharing alone can be the difference between a 16x parallelization speedup and a 1000x+ slowdown over the naive serial algorithm. 16x can be the difference between actionable results tomorrow or three weeks from today... and if you didn't know about it, the 1000x+ slowdown would be otherwise inscrutable.

Right, the question then becomes "What percentage of programmers are going to need to optimize code beyond the lowest hanging fruit"

I don't know exactly what that number is going to be, but I do know it is going to be a lot less than 100%.

"most of the software being built today"

"web applications/web sites/microservices"

The world is bigger than you think it is. There could be more code in making a single AAA videogame than in the entire Amazon infrastructure.

Having a certain amount of Mechanical Sympathy (https://mechanical-sympathy.blogspot.com/2011/07/why-mechani... ) is important for having a good intuitive feel for the appropriate way to implement something.

For example, an O(n^2) algorithm can often beat an O(n) one when n is "small". How do you get that "experienced" feel for which to choose when you're designing your program?

This is even important for people doing REST APIs, Web Services and Microservices. Milliseconds matter when the load gets beyond a certain point and your system ends up full of stragglers. How do you get that "experienced" feel for where that point might be and how far you can push a particular architecture without investing in a major rework?

The effect of tail latency on most web servers might surprise you:



TL;DR: most of your visitors will experience the bottom 1% of your performance curve on every page load.

"Why would any of this stuff matter to them?"

Back when I started, both the BASIC and Common Lisp camps had heuristics on helping the compiler generate more efficient code. The reasons are similar to today: increase throughput for scaling or customer experience while simultaneously reducing what you spend on hardware and/or electricity. Lean, easy-to-manage setups can also sometimes let you afford extra personnel to build stuff faster.

There's definitely a cut-off where micro-optimizations wouldn't be necessary. A lot of efficiency gains are simple, though. Just gotta consistently use what you learn.

Good list.

I'd add "Virtual Memory" to that list. In particular, the TLB cache, memory pages (4kB, 2MB "Large Pages", 2GB "Huge Pages).

Although x86 specific, I'd also add x86-64 has 48-bit physical pointers: the top 16-bits are basically ignored by the current virtual memory system. I dunno if the whole Page Directory / Directory Tables / etc. etc. needs to be fully explained, but programmers should have an overall good idea what they are.

There's lots of things to do with Virtual Memory. And anyone who actually reads profiler data needs to understand what the heck that TLB Cache Hits performance counter means.

> * False sharing

More specifically the MESI Protocol (although that's an abstraction), and cache lines should be taught. False Sharing comes as an understanding after you understand those other two concepts.

A CPU Core holds a cache line in Exclusive state so that it can write to it. A 2nd CPU Core attempts to gain access, but it cannot until the 1st core releases control (by writing data back to memory and setting the line to the Invalid state).

The knowledge of the 64-byte cache line is more general than just false sharing: it helps understand why alignment can be an issue (a load/store across a 64-byte cache line would require 2-reads by the memory controller), etc. etc.

I don't think you need to explain the intricacies of the MESI protocol — just explaining the fact that caches need to be consistent is quite sufficient. Perhaps throw in why they must be consistent. It then becomes clear that the cores need to communicate (somehow) to maintain this consistency if they're touching data within the same cache line.

MESI isn't really that complicated. Cache-lines are either Exclusive owned, Shared, Invalid, or Modified. CPU Cores communicate to each other which lines are owned or unowned, and that's how the caches stay coherent. If a CPU Core wants to change a cache-line owned by another core, they have to wait until the line is closed (set to "Invalid" state) by the other core.

I think its easier to explain cache-coherence through MESI, rather than to abstractly just say "Caches are coherent". At least personally, I didn't understand cache coherence until I sat down and really tried to understand MESI.

I guess other people learn differently than I do, but I always view cache-coherence through the MESI lens.

>"Although x86 specific, I'd also add x86-64 has 48-bit physical pointers: the top 16-bits are basically ignored by the current virtual memory system."

Could you elaborate on this? Which pointers exactly?

All pointers in user-space are 48-bits. All 48-bit pointers are translated by the page-directory virtual memory system into a real physical location.

There's an extension to use 55-bits or 56-bits... I forget exactly. But I don't think its actually been implemented yet on any CPU yet.

EDIT: It was Intel's 57-bit memory: https://en.wikipedia.org/wiki/Intel_5-level_paging . Yeah, I knew it was a weird number. But I guessed wrong earlier.

Oh I see what you are saying but that's not really just a pointer or a userspace thing, that 48 bit limit is simply imposed by the x86-64 CPU vendors in no?

With 48 bits you can still address 256TB of memory. I guessing that from a practical and financial point of view it probably made little sense for vendors to build a CPU that enabled addressing the full 64 bits. At least for now.

how is MESI "an abstraction"? why is it the only model proffered instead of MESIF/MOESI?

and "the top 16-bits are basically ignored" is a funny way to spell "general protection exception on linear memory reference in non-canonical space" but sure, guess we're just handwaving here

> how is MESI "an abstraction"? why is it the only model proffered instead of MESIF/MOESI?

Because no CPU actually implements MESI. All CPUs implement more complicated stuff, like MESIF / MOESI. Instead of going into MESIF (which only Intel CPUs implement) or MOESI (only older AMD CPUs implement), lets just stick with the textbook MESI.

Which is "wrong", but its "correct enough" to explain the concept. That's what I mean by an abstraction, no CPU today actually does MESI, its simply a concept to introduce to solidify the student's understanding of cache coherency. Its close enough to reality without getting into the tricky CPU-specific details of the real world.

> guess we're just handwaving

I mean, you have to set those bits back to 0 before using them as a pointer.

But the system will literally never use those top 16-bits for anything. So some highly optimized code stores data in those top 16-bits and then zeros them out before using. IIRC, Lisp machines and various interpreters.

Maybe it's because I'm a physicist and not a software engineer, but it's nice to know how things work all the way down even if in the end, a phenomenological model is all you need to do your work.

It depends. I find that most programmers not knowing some of this happens to be the bane of most performance and reliability issues. Take Java for example - pretty sure the Java dev needs to know how to optimize their jvm (memory settings, etc.). They would need to know direct memory, etc. and how it relates to the operating system. This nastier your traffic profile the more important the tuning described in that document are. Like all good "books" I don't remember all of it but keep coming back to refer to it.

But again this is speaking from my experience as an SRE/SDE (or a perf/reliability engineer).

I'm sure some programmers don't need to know this. For example an Erlang or Haskell programmer can't do anything about memory layout anyway, so this knowledge would be of no immediate practical use.

> For example an Erlang or Haskell programmer can't do anything about memory layout anyway

No, but we can make algorithms that will do things in other ways than standard. Knowing memory layout and behaviour enables programmer to invent better ways to do things. Of course those algorithms are very application specific, but at least it increases solution space. Erlangers mantra here is "profile this" so that you know which one is actually better.

Not exactly. There's plenty of ways of looking at GC and controlling allocation and lifetimes in Haskell is definitely possible.

And when necessary the C FFI is useful to have to write small bits of code to take advantage of particular layouts and drive them from code. See [0].

[0] https://www.youtube.com/watch?v=b4bb8EP_pIE

No, it isn't supposed to be ironic.

The title says "should," after all. Who says every programmer should knows these things about memory? Obviously not you. Ulrich Drepper does - I bet if you asked him, he would say everyone should know these things, but would concede that almost no programmers do and most programmers don't need to.

I work in frontend after switching gears from embedded systems a few years back. Knowing some level of detail about how computers work is invaluable at all layers of the stack: I can make informed trade-offs between practical performance of code running on an actual computer and the cost of high-level language concerns and features.

Can't agree more. If I went to a software job and started reading that stuff I'd get moved into hardware :)

I suppose that's a credit to all the engineers who build all the middle layers that allow software engineers to float along at an abstract and more productive level.

Every programmers should know something about memory, because software are becoming bloat [1] [2] because of a new generation of programmers who doesn't optimize their code.

So, of course, you had a wonderful career, but it doesn't prove that you write efficient code

[1] http://www.rntz.net/post/against-software-development.html [2] http://tonsky.me/blog/disenchantment/

It really all comes down to what variable one is trying to optimize. I noticed neither article mentioned money as a possible variable to optimize. Here is another take to consider:


"In 1993, given the cost of hard drives in those days, Microsoft Excel 5.0 took up about $36 worth of hard drive space. In 2000, given the cost of hard drives in 2000, Microsoft Excel 2000 takes up about $1.03 in hard drive space. In real terms, it’s almost like Excel is actually getting smaller!"

That's true, but I don't see how it's relevant to this article. We can simply tell people "Write your programs so they use fewer resources, and they will run faster".

Reading a paragraph like this:

> "The CAS signal can be sent after tRCD (RAS-to-CAS Delay) clock cycles. The column address is then transmitted by making it available on the address bus and lowering the CAS line."

is certainly interesting but it offers no additional insight in to how to write an efficient program. Regardless of whether the row and column addresses are sent on the same bus or a different bus, the optimization strategies for programmers are exactly the same. The same goes for almost all of the information here.

Unless you're writing a kernel, 99% of this article is overkill. I mean that literally: it's over 100 pages long, and I think the important and relevant points for most programmers could be summarized in a page or two.

Is it better for my company if the software is delayed to market by 6 months because we were writing the most optimal code possible in low-level languages gaining us 5% more speed on our application?

Most programmers are working for a business whose goal is to make money. Usually that involves adding more features or producing the product faster (as in development time). As far as optimal or efficient goes, all the company cares about is: "Is it fast enough so that people still buy/use the product?"

Premature optimization is wasteful, both a waste of time and money. Unless you're doing it for fun, trying to eke out 1ms on a website load at the cost of excessive dev time and risk of bugs, is just silly.

Don't forget, trying to make super optimal code often introduces bugs as you now have code that is more complex and/or uses more low level constructs. I've seen many deadlocks and race conditions due to engineers trying to optimize code unnecessarily.

> Premature optimization is wasteful

More than premature pessimization? Setting your project performance goals is plain good design. It's not premature optimization to suggest that your web server should be able to respond to queries within an average window of time. Working with any kind of service level object and agreement practically requires you to think about these goals at the design phase regardless.

I've come across teams with engineers who couldn't understand why their application was timing out. They had no idea how to work with data structures and memory layout to optimize access. Their response to business stakeholders was to weakly raise their hands to the ceiling and say, "It can't be fixed." Their customers would not have been happy with that answer; they had precious data and reports to deliver.

It's worth knowing, at least on a conceptual level, how memory architectures and data structures affect performances. Otherwise you end up paying people like me a lot of money. And you don't want that.

> all the company cares about is: "Is it fast enough so that people still buy/use the product?

Studies have shown that is not the case.

Amazon found that every 100ms of latency cost them 1% in sales. Google found that 400 milliseconds means a nearly 0.5% decrease in search sessions.

So yes, a 5% speedup in an application could be an enormous win for a company.

I think the argument is that most programmers don't need to write efficient code.

It’s fair game for a python/java/c# interview though! /s

There are some salty comments here, but I think the context is important. This paper passed across my desk in early 2008 when I was doing HFT stuff. It might be a bit of a stretch to say that the reason people are taught about cache lines in most CS programs is because of this paper, but at the time this paper was written, this was really specialized knowledge and groundbreaking to most software developers. This would go on to be a popular topic on C++ blogs from Important People (Boost maintainers, STL devs, etc) at least for the next 5 years.

Also, if you know Ulrich Drepper at all, either from some of his talks or his mailing list presence, this is just a very fitting title from him. Just pure deadpan, you think its funny, he probably does not, the fact that you think its amusing is just disappointing him like a professor looking out at freshman undergrads wondering how he got stuck teaching this class.

I really do agree with you. HAving done HFT for some years, this paper was crucial when it was running on linux system. Now FPGA took over the field. Different kind of techs.

He mentions his reason for the title:

> The title of this paper is an homage to David Goldberg’s classic paper “What Every Computer Scientist Should Know About Floating-Point Arithmetic”

I wish Ulrich Drepper (thank you, Mr. Drepper) would update this with a section on Row Hammer and also Spectre and Meltdown. Programmer's need to know about memory because of these exploits, more so with the latter two in order avoid creating exploitable gadgets.

But then I also think that What Every Computer Scientist Should Know About Floating-Point Arithmetic should be updated to include UNUMs. I don't think that will happen either. Also, thank you Mr. Goldberg.

What Every Computer Scientist Should Know About Floating-Point Arithmetic is great because it makes people aware of problems with the current system. Adding UNUMs would be suggesting the devil you don't know over the devil you know.

However, and I can't remember if it is alreaddy in the book, a section on compensated summation / dot product would give strong time tested tools to attack the problem from within the current framework.

Row Hammer, Spectre, and Meltdown are hardware design flaws which don’t nessisarily stay meaningful over time.

Further, they are completely irrelevant for many developers. The next Mars rover for example is not going to be running untrusted code.

Thanks for bringing that up, I came to the comments section to see if some other details were outdated. Any other insights?

Well, I think hardware and software prefetch, speculation is a special form of prefetch, might be revisited. Software prefetch was at best a hopeful technology even in 2007 and is widely avoided today.


Something I don't think either Drepper or Hennessy + Patterson's books get across is memory banks from a programmer's perspective. How cache organization affects a program is explained well but how banks affect said same isn't. Construction yes. Visibility, no.

What every programmer should know about memory shouldn't be 114 pages long.

"A bit more than what most programmers need to know about memory, but would be nice if they read anyways"?

A _lot_ more. I don't see how most programmers would benefit from reading 100+ pages of low-level details about memory.

Because knowing your craft is important? Besides, if you don't find this interesting why become a programmer in the first place

There's a lot to learn about this craft, and people have to prioritize - knowing algorithms & data structures is more immediately useful compared to, say, knowing what scratchpad memory is. If I spent my time learning every detail about every system underpinning every abstraction, I would literally be 70 years old by the time I started writing code.

> Besides, if you don't find this interesting why become a programmer in the first place

Who is saying it's not interesting? We're arguing that it's not fundamentally vital knowledge to know the difference in RAS & CAS latency for SDRAM for most programmers.

But learning algorithms and data structures literally requires you to know about memory on a pretty low level. As I'm sure you know, a lot of algorithms and data structures that are theoretically equal can be vastly different in practice in no small part because of how they use memory.

If you are objecting to "every," then sure, not every programmer needs to know anything about memory. You can program without knowing that there is such a thing.

But it's a fun title, and an excellent resource.

No, he's right. It's more than simply semantics. A better title would be "an extremely overwhelming of programmers don't need to know THIS much about memory"

Or perhaps some more positive phrasings:

- If you know this much about memory, you'll know more than 99% of programmers.

- What 99% of programmers don't know about memory.

The second one reads like a Buzzfeed article...

> What 99% of programmers don't know about memory... number 17 will shock you!

This is what people says when they have a dataset of 100 entries(<100mb) on their developpment laptop. How funny they look when production guys show them how their lame piece of code doesn't scale up with billion of entries in the production platform...But they dont care cause they are developpers, (or so called). Until the production guy rewrite their code and have them fired for incompetence (true story).

> How funny they look when production guys show them how their lame piece of code doesn't scale up with billion of entries in the production platform...But they dont care cause they are developpers, (or so called)

Why did the production guy even need them in the first place? Why were they hired?

> (true story). And that production guy's name was Albert Einstein.

I was so excited to dive into this, but ended up with the same Takeaway as most other commenters. Aside: As a data scientist, I’ve been surprised how much I’ve needed to learn about the finer points of optimizing GPU utilization for training.

It has all been from more experienced coworkers, and I would much appreciate any resources anybody could point me to (free or paid) so that I could round out my knowledge

Learn enough about GPUs to be able to read the profiler. That should be your #1 goal: learning to use the profiler and performance counters.

The profiler not only tells you how fast your code is, but also why your code is fast or slow... at least to the best ability of the hardware performance counters.

Is it RAM-bottlenecked? Is it Compute bound? Are your Warps highly utilized? Etc. etc. If you don't know what the profiler is saying, then study some more.


Interesting that in 2007 he thought FB-DRAM was going to win. That seems to have been about the time it dropped dead.

Right, also about NUMA:

> It is expected that, from late 2008 on, every SMP machine will use NUMA.

Outside servers, still not happened.

They are not that exotic anymore and are no longer exclusive to very expensive servers, e.g. Threadripper 2920X is a $650 CPU, but market penetration is still low.

For an accessible talk about the real-world implications of this, I enjoy watching Mike Acton's CppCon talk "Data-Oriented Design and C++": https://www.youtube.com/watch?v=rX0ItVEVjHc

I routinely re-watch this talk. It always gets me back in the right mindset.

Ulrich Drepper used to be the glibc maintainer, IIRC


Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact