Hacker News new | past | comments | ask | show | jobs | submit login

I actually don't think this is that profound; when OP is testing datasets that fit within cache, what's happening is that we are simply not waiting for data to be loaded in from RAM. Let's look at this from a different angle; instead of looking at GB/s, let's think about the CPU as a machine that executes instructions as fast as it can, then look at what can go wrong.

I could write a program in assembly that is simply 1000000 triplets of (load, add, store) instructions, each reading from a sequentially-increasing memory location. We could think of it like a fully-unrolled addition loop. My CPU, operating at 3GHz, supposedly should complete this program in ~1ms (3 million instructions running at 3 billion instructions per second), but (spoiler alert) it doesn't. Why?

The answer is that the CPU executes instructions as quickly as it can, but it spends an awful lot of time waiting around for various subsystems to finish doing something. Unless you are running highly optimized code, chances are your CPU, even though "utilization" is pegged at 100%, has more circuits sitting around doing nothing than it has circuits doing something. One of the largest contributors to your CPU impatiently tapping its foot is the memory subsystem, because fetches from RAM are so slow, and the `load` instruction (and all dependent instructions such as the `add` and `store` instructions) is completely blocked upon that instruction finishing completely.

To help with this, we have the whole cache hierarchy, with prefetching of whole cache lines and whatnot, to try and grab pieces of memory that we think will be used next into cache, such that we can then access them with _much_ lower latency (and therefore higher bandwidth, since the only thing preventing us from processing more data is waiting around for more data to come in, ergo latency is directly the inverse of bandwidth in this case).

Therefore, when doing random accesses upon a dataset that won't fit into cache, we expect the average time-per-instruction-retired to be roughly how long it takes to pull it in from RAM. Sequential access is faster only because when I ask for a value and it doesn't exist, I grab not only that whole value, but the entire cache line at once, such that the next couple of loads only have to reach out to cache. Smart compilers can place prefetching hints in here as well, so that future loads are overlapping our accesses to cache.

The reason I find random access into cache having the same performance as sequential access as not that profound is because it falls out directly from the above scenario: sequential access into RAM _is_ random access of cache! The reason sequential access to RAM is fast is because the values are in cache due to having fetched an entire cache line; therefore randomly accessing those same cache buckets in a random order is equivalent (from this perspective).

For those that are interested in learning more about the cache hierarchy and why it's important to organize your algorithms/datastructures such that you can do a lot of work without your working set falling out of cache, I highly recommend reading the HPC community's tutorials on writing high-performance matrix multiplication. Writing a fast GEMM is one of the simplest algorithms and datastructures that will drive home why we have tricks like tiling, loop unrolling, loop fusion, parallelization, etc.. to make maximal use of the hardware available to us.

For those that like academic papers, Goto et. al's paper has a lot of this laid out nicely: https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf For those that like follow-along tutorials, this was a fun one for me: http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/




Hi. OP here.

> The reason I find random access into cache having the same performance as sequential access as not that profound is because it falls out directly from the above scenario

Correct. The behavior can be logically explained. There's no magic involved.

> sequential access into RAM _is_ random access of cache

I actually like your statement even better than my post. Sequential access into RAM _is_ random access of the cache! What a delightfully profound statement.

High-performance computing is your specialty. Of course everything in my post is obvious to you. Elementary even.

If you want to argue the semantics as to whether a logical conclusion constitutes as profound or not. Well, I guess?

Not gonna lie. Your comment kind of comes across as a long-winded humblebrag. "Everything OP said is true. I just think it's obvious." Sorry if that wasn't your intent.


Sorry, I don't mean for this to come across as playing down your post; you're right that "profoundness" is completely subjective and there's no use in me saying "it's not that profound", for that I apologize.

I enjoyed your post quite a bit, especially how far you went in order to get concrete results that back up the theory. Thank you for going through the effort of writing this post and educating others on your findings. :)


Thanks, I appreciate that. <3

:)


> I could write a program in assembly that is simply 1000000 triplets of (load, add, store) instructions, each reading from a sequentially-increasing memory location. We could think of it like a fully-unrolled addition loop. My CPU, operating at 3GHz, supposedly should complete this program in ~1ms (3 million instructions running at 3 billion instructions per second), but (spoiler alert) it doesn't. Why?

I'll bite. Why doesn't it? And how long do you expect it to take? I'll claim that with a modern processor a simple loop in C probably beats this speed. If you want, we can test afterward to see if our respective theories are true.

The linked article claims that single-threaded reading speed of sequential memory (on his machine) is 11 GB/s. This means a 3 GHz system has a throughput of a little over 3B per cycle (11/3). This means that ever 3 cycles we can pull down 11B, which should be enough to comfortably finish our 1M loads in 1 ms. With 64-bit integers, it's getting a little tighter, but still should be possible.

I guess on a technicality you might be right, but not in a good way. If you were to fully unroll the assembly, you might be able to slow things down enough such that you were running at less than 1 integer per 3 cycles. A simple loop is going to be faster here. Done right (fused SUB/JNZ) the loop overhead should actually be zero. Depending on what you are doing with the store (in place, or to another sequential array?) I'd guess you'd be able to get down to less than 2 cycles per 32-bit int with simple C, and pretty close to 1 cycle per int if you went all out with software prefetching and huge pages.


I think I see the problem. OP thinks these sentences say the same thing:

"Random access into the cache has comparable performance to sequential access from RAM."

"Sequential access from RAM has comparable performance to random access into the cache."

Whereas I do not.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: