Hacker News new | past | comments | ask | show | jobs | submit login
The Elements of Cache Programming Style (2000) (usenix.org)
51 points by rramadass on Feb 16, 2022 | hide | past | favorite | 17 comments



I suppose I should say ask me anything.

This was written at a time when the Linux scheduler was a simple process loop. So Linux would come to a crawl when the process count hit like 64. I figured out the underlying but I didn't submit a patch. Then I really started looking at the problem, and the result of that investigation was this article which was/is a bit bombastic but a good tutorial on issues with caches nonetheless.

Weird that it's on the front page lo these many years later.


I posted this because i saw Drepper's paper on Memory make its appearance again and had always thought this short write-up (even though aimed at Kernel Programmers) was/is far more relevant.

Maybe update the article to current times into two parts? :-) One explaining general Cache principles and how an "Application Programmer" can take advantage of them and the other on how the Linux Kernel does it; Theory and Practice!

PS: The Schimmel book (which you have also referenced) is the only place where i have seen all the cache issues explained comprehensively and clearly.


Thanks, any comparison to Drepper is a distinction.

I'd like a tool, maybe part of Cachegrind, which said dude, this instruction is thrashing this L2 set. A profiling virtual machine (Valgrind) is kind of the ultimate lookaside to do this. It could easily (if expensively) evaluate these kinds of queries at each load/store. I think it's better to be able to find these problems than understand them from first principles.

I think something like Computer Organization And Design could have a section on cache programming and maybe a problem set. But if I was to revisit this, I'd write something like A Bestiary of Caches which would delineate all of the different caches in a modern CPU. Things like branch prediction target caches, the micro-op cache, the LSD, ... but all in one place.


>But if I was to revisit this, I'd write something like A Bestiary of Caches which would delineate all of the different caches in a modern CPU. Things like branch prediction target caches, the micro-op cache, the LSD, ... but all in one place.

You said it; now you gotta do it :-)

PS: Jokes aside, it is a sincere request. Most folks don't know about these issues and if they do, it is piecemeal, bringing them all together in one place by somebody who understands it all would be invaluable.


Has anyone else noticed how data oriented design has became more common in the past years? At least it's my perception.

For anyone interested on similar patterns on performance or good memory access practices. I highly recommend this recent talk by Andrew Kelly on data oriented design. It's one (if not the best) talk about the topic.

https://media.handmade-seattle.com/practical-data-oriented-d...


I don't know who started the trend. I personally first heard about the term from Mike Acton's CppCon 2004 talk.

But as pointed out by another comment, I don't think it's that mainstream. Outside HPC and game development. At least that's what came up when you google the relevant terms like cache, locality, etc.


Thanks that's an interesting talk and I enjoyed the examples, they're very clear.


It appears this web page hasn’t been cached.


does anyone have any similar articles targeting arm64 (m1)?


I think the issues are the same for the arm64/m1. The cache structures are probably even similar. However, modern CPUs are out of order superscalar and dual ported.

You might look at something like the Arm® Cortex®-A75 Software Optimization Guide for a good picture of the microarchitecture. Unfortunately Apple doesn't put out that kind of information for the m1. The LLVM docs have a little information but the structure isn't that different from an A75 although the latencies are.


Sincerely caring about memory coherency is a totally lost art. Hardware gets faster, programmers add more abstractions to make their lives easier (read: ship more product, make companies more money) and the consumer is left kicking rocks with bug-riddled code 100x less performant than it could be.


It is astonishing to me that the most basic of memory hacks are regarded as dark magics these days. Many things take zero extra effort to make orders of magnitude faster if you are simply aware of some basic concepts.

For instance, if you know a collection of things is going to be bounded by a reasonable quantity (e.g. less than 1mm) and that you will frequently need to perform a full linear scan of these things, you should consider using an array rather than some b-tree or other pointer-chasing type (List/Dictionary/etc). This is literally a simple "Should I use List<T> or T[MaxElements]" determination. The effects of this are very profound if many scans of these items need to be made per unit time (i.e. a game loop).

You can take this one step further by trying to make the type used in the collection a value type rather than a reference type. Depending on language/runtime/problem space, this can give you another order of magnitude uplift. And, its pretty much as simple as "public [readonly] struct" vs "public class".


I literally just submitted a ticket to an Azure SDK because every API call was taking on the order of 100s of milliseconds.

This is because they were sequentially trying things in a loop and using exceptions for flow control.

If you use an APM, the overhead of this is monstrous.

This particular function is called in every authenticated Azure call by default. Using any such API turns the debug output into a wall of spurious errors scrolling past like the Matrix.

To top it off, they use asynchronous code in a sequential style, so the latency is strictly worse than naive synchronous code. This is a for a bunch of back to back HTTP API calls.

They closed the ticket with a “this is what the style guide says to do” and now millions of customers have to just live with this.

If you ever wondered why nothing takes less than a second in Azure, it’s just a handful of reasons like this.

An easy fix to reduce API response times from seconds to milliseconds… rejected.


Alright, so here's something stupid I've always wanted to ask: Is there some kind of library/language/paradigm where instead of declaring what data structures you want to use, you declare constraints and expectations and it guesses the "best" one? So even if you're uncertain, you can say "I expect this collection to have about n elements", "this should be sorted", "I expect it to grow in a certain way", and so on (of course you might want to express much more advanced patterns).


Some upcoming languages like Zig and Jai have features to allow you to easily switch from array-of-structures to structure-of-arrays at writing time (https://www.youtube.com/watch?v=zgoqZtu15kI). Some edge cases also require rather unique ways of chopping up data, that wouldn't lend themselves well to be shoehorned in one of the canned categories of transforms.

Although constraint declaration is orthogonal to data-oriented development, I'd say that the philosophy of using generic constraint solvers (or other overly generalized CS ideas) goes against the spirit of DOD, which is to "just simply do the work". It is after all reacting against object-orientation (and even modern FP) which tended to think too much in the abstract and worrying too much about some form of taxonomy, as opposed to coding against plain data.


> Some upcoming languages like Zig and Jai have features to allow you to easily switch from array-of-structures to structure-of-arrays at writing time.

no need to reach for upcoming languages, it's doable directly in C++: https://github.com/celtera/ahsohtoa =)


Sounds like you want SQL.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: