Hacker News new | comments | show | ask | jobs | submit login
Effects of CPU Caches (medium.com)
99 points by chewxy 3 months ago | hide | past | web | favorite | 23 comments

There are a couple of issues here and there but I'm just glad another software guy is looking into hardware and not just assuming the compiler can figure everything out automatically.

I had a Go course (Ultimate Go from William Kennedy) and basically he says he's not going to teach you the language, it's simple enough if you've got a bit of programming experience. Instead, the first day was mostly about the pretty much 1:1 relationship between code and hardware, using show instead of tell, e.g. going through a 2d array row-based or column-based and showing the difference (memory is row based), and also a lot about CPU caches - although there he gratuitiously showed us a lot of footage from a great presentation from an Intel guy.

Any chance the Intel guy presentation is public and you can share? I'd like to take a look myself.

Thanks!! :) I'd be very glad if you could point me out the issues you found!

Few people realize this, but completely random RAM access can have lower throughput than e.g. large linear reads from SSD. That’s one of the reasons why I don’t trust languages which give me little to no control over data locality.

Pages from disk are resident in memory when retrieved, so linear reads from ANY disk (not just an SSD), may be faster than random RAM access at some point (the curves will intersect after you pay the initial seek cost). If you're smart about queuing additional page fetches while traversing existing pages, you can accelerate this even more.

Not even when pages are resident in memory. I mean after it gets going, ignoring the initial latency of access, which is indeed staggering by CPU standards. To dramatically slow down memory access it’s enough to just miss the cache every time. Here’s another thing most people don’t know: missing cache results in at least 100-200 cycles of delay to fetch from main RAM. People don’t quite get how long that is. A good demonstration is to count aloud to 200. And then realize that if it was L1 hit, you’d be doing several operations simultaneously per every single count. As CPU frequencies increase, this number trends upward. All of a sudden your ~26GB/s of memory bandwidth turns into a pumpkin.

Nice. I was thinking about this lecture too.

> Because the point of this experimentation was to iterate over a list by accessing the memory and so to measure the time for each iteration, I had to be sure every executed operation at each iteration was the most stripped-down set of instructions. Hard to be shorter than one cpu instruction for the given loop:

I’ll wager that the author didn’t look at the generated assembly as if that loop has only one CPU instruction, that instruction would be an increment, making the analysis of cache effects moot.

Hi, thanks for reading. I'm sorry if this statement was not clear. It was of course outside the instructions necessary for the loop itself.

Most variables are registers, so it would have no effect on cache since there is no memory used to store the data. Only if that variable is pushed onto the stack/heap would it then consume cache. However, the size of the stack is much smaller when compared to the size of the data structures in play, so only a small percentage of cache is used for this control data.

In a handrolled assembly loop, the array pointer can be the iterator itself. In fact, this would be the ideal to compare the linked list version to, as the linked list would not have any kind of incremented iterator except for the pointer itself..

Yes you're right! Since I was using the benchmark tools of go test, I had to use this increment. The main goal was to compare exactly same operations between each benchmarks.

However, in the last test I ran, where I was walking indefinitely over the linked list, I guess I was closer to the behavior you're pointing out.

One of the effects first noticed in CS papers maybe 20 years ago is that binary heaps should NOT start from the beginning of a cache line, which is what most programmers would assume would be a wise thing to do. We're always told to align to cache boundaries. If you don't move the root node to the end of a cache line (or at least toward the middle, depending on line size and node size), traversing the heap will touch more lines, since the sibling nodes you need to compare will often/always be in separate cache lines.

Hi! I'm sure I misunderstand your comment, but correct me if I'm wrong.

For instance, if each item has a size of 64B (for N=7), which is the size of a cache line, then whatever the position of the first element into the first cache line, the next item will necessary be located into the next cache line. Did I miss something?

In a binary heap, there's a single root, but after that, there are pairs of nodes. In your case, the item has the same size of a cache line, so things are not too bad. During traversal, you'll always read two whole lines for each pair of nodes. You could still worry about the CPU cache's n-way associativity, but either that requires a lot of control over memory allocation layout or it will provide diminishing returns.

Anyway, the interesting case is when items are smaller than a line. Look at an array-based implementation of a heap:


Root is item 0. Assume it's aligned to a multiple of 64 bytes and you have two 32-byte items per line. When comparing array items 1&2, 3&4, 6&7, etc., notice how the two items are always in different cache lines. Item 1 is in the same line as item 0, not item 2. You're always crossing the line boundary. If the root were at item 1 in the same array (or, equivalently, item 0 in case the array started at a multiple of 64, plus an offset of 32), then each pair would be always in the same cache line.

In case it wasn't clear, I was pointing out a similar phenomenon with other data structures; it wasn't about linked lists.

Ok yes!! This is perfectly clear now, thanks. I didn't know, very interesting and a smart optimisation of cache lines use!

I've also read that the processors are able to ask words (of a cache line) in different orders, and so ask for hot words first. It may help a little if this kind of structure was not aligned in an optimized fashion. Doesn't work for HW prefetch, though.

This dissertation goes into great detail on this topic: http://people.mpi-inf.mpg.de/~tojot/papers/TowardBetterCompu...

> The hardware prefetch is not really efficient when N>7, because each list element is spread over several cache lines, plus the prefetching is not able to cross page boundaries.

Could you explain the first part of the sentence (about the list elements spreading over several cache lines)? It's true that the elements do not fit into a cache line for N>7. But is that important for the prefetcher? Shouldn't the regular access pattern to the "en.n" be detected by the streamer prefetcher (without shuffling, of course)?

Hi, thanks for your question.

You're right, this is exactly why a few paragraphs later I doubt about that statement, saying that the prefetcher is certainly not reading the entire array and may act exactly like N=7. I think that this is more about the fact that the prefetcher can't cross page boundaries.

That's still right for N<7, though. I still have to dig on this part :)

Thanks again!

This page http://hexus.net/tech/reviews/cpu/37989-intel-core-i7-3770k-... claims that i7 CPUs have a "next-page prefetcher". I don't know how different i5 and i7 are. Do you think your i5 CPU doesn't have it?

Sorry for the delay, I took some time to perform additional researches.

What I found is that my i5 has the code name "Products formerly Haswell". I don't exactly know if it has the "next-page prefetcher", but I would say yes.

Looking at the document that a reader on Medium linked to me https://software.intel.com/sites/default/files/managed/9e/bc...

it looks like the NPP has been introduced into the IVY Bridge family (paragraph 2.4.7), and the paragraph 2.3 says: "The Haswell microarchitecture builds on the successes of the Sandy Bridge and Ivy Bridge microarchitectures.".

That said, how the NPP is working is not really clear. I'd say it's maybe more about prefetching the next page to avoid a TLB misses. In this document I didn't found any statement about crossing a page boundary during data prefetches.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact