Hacker News new | comments | show | ask | jobs | submit login

The constraining factor in this case is CPU throughput. The CPU cannot execute instructions fast enough to keep up with the data being read in from memory.

Of course you are right that this is not the whole picture, it is possible to be bottlenecked on neither memory bandwidth nor CPU throughput.

With respect to memory, two additional considerations are read amplification and latency.

Read amplification: Read amplification basically means that you are reading data but not actually using it. Suppose we have a table stored in row format with four columns C1, C2, C3, C4 which store 8 byte integers. Row format means that the values for the columns are interleaved, so the memory layout would look like this, where cn refers to some value for column Cn:

  0x0000 c1 c2 c3 c4 c1 c2 c3 c4
  0x0040 c1 c2 c3 c4 c1 c2 c3 c4
  ...
One property of conventional memory is that we can only read memory in blocks of 64bytes (the size of a cache line). So if we have a loop that sums of all of the values in C1, we will actually also load all of the values for columns C2, C3, C4. This means we have a read amplification of factor 4, and our usable bandwidth will be a quarter of the theoretical maximum.

Latency: Suppose we also don't have any read amplification and make use of all values for each loaded cache line. Then after we finish processing a cache line, and load the next cache line, we might have to wait a long time (100s of cycles) before the new cache line actually arrives! So then we would actually be stalled because of memory latency during some parts of program execution.

One of the big advantages of using columnar layout (which stores all values for a column sequentially) is that it all but eliminates read amplification. Similarly, by accessing data sequentially we make it possible for the CPU to predict what data will be accessed in the future, and have it loaded into the cache before we even request it.

Of course there's a number of reasons why things might not work out quite so perfectly in practice, and those may why performance in this benchmark starts to degrade before the (usable) memory bandwidth reaches the theoretical maximum memory bandwidth. But as you can see it still possible to get very close, so it's safe to assume that queries which read much less data than the maximum bandwidth are constrained on CPU.




Excellent explanation of the advantages of a columnar layout! I'm still a little doubtful about the exact diagnosis, though.

The constraining factor in this case is CPU throughput.

When you looked at it with VTune, did you find clear evidence of this? That is, were you consistently executing 4 instructions per cycle, or were you maxed out at 100 percent utilization for a particular execution port? While it's possible to be at these limits (say if you were hashing each input) in most cases I think it's rare to get to that point when reading from RAM.

Similarly, by accessing data sequentially we make it possible for the CPU to predict what data will be accessed in the future, and have it loaded into the cache before we even request it.

One important caveat to this (at least on Intel) is that the hardware prefetcher does not cross 4KB page boundaries. So if you are doing an 8B read every cycle, and if you are able to do your other processing in the other 3 µops available that cycle, it's taking you ~500 cycles to process each page. And then you hit a ~200 cycle penalty when you finish each page, which means you are taking a ~25% performance hit because of memory latency, and probably have a CPI closer to 3 rather than 4! So if you (or your compiler) haven't already done so, you might see a useful performance boost by adding in explicit prefetch statements (one per cacheline) for the next page ahead.

Anyway, thanks for your response. I like your approach, and I wish you luck in making it even faster!


Yes, excellent points. No doubt there is still much room for optimization.


Have we really progressed memory speeds to the point that reading memory is faster than execution speed on the processor? That just feels wrong.

Your other points make sense. And those would surface as constraints in a similar fashion as a CPU throughput limitation would. I just expect those well before a CPU constraint.

I confess I have not researched CPU speeds in quite a while. Nor RAM speeds. Just surprised to see that the CPU's throughput would ever really be the bottleneck.


> Have we really progressed memory speeds to the point that reading memory is faster than execution speed on the processor? That just feels wrong.

Sort of. Random main memory access speeds aren't there, but L1 is; as long as your memory accesses are sequential, proper coding can stream sequential main memory into L1 and keep the CPU working.

But it's effectively a lost art - the vast majority of programs spend a significant part of their time stalling on memory access.


That doesn't follow. Yes, the cache is fast enough, but assuming you really are using every cycle, eventually the CPU will catch up, no?

My metaphor is a tub. If the drain is faster than the faucet, filling the tub before opening the drain will still see an empty tub.

Only way you can prevent that is if the faucet is the same speed as the drain. Right?

Now, again, the cache can help you see more time where you are getting max utilization. Hard to believe it would be a full workload, though.


To stretch the tub metaphor to include latency, you have to imagine someone turning the faucet off and on (or plugging and unplugging the drain) depending on how full the tub is.. and, with memory, there may be one spigot (the bandwidth), but a tremendous number of valve knobs (addresses), such that the someone has to decide the correct knob to turn before the tub empties out completely. If the wrong valve is opened, no water comes out (the wrong data is loaded into cache, which is the same as no data being loaded), so merely guessing is no good.


Main memory has enough bandwidth but not enough latency. If your code is written to account for that, then - yes, your CPU will be your bottleneck. If it isn’t then your CPU will be waiting for memory.

DDR4 can do >20GB/s but the latency is >10ns. If you prefetch enough cycles in advance, data will be in L1 by the time you need it. If you don’t, it’s won’t.


And that is still surprising to me. Quickly checking the internet, I see [1] which lists the throughput of an i7 is about 25 GB/s. That said, I see the newer specs report a GT/s, which is new to me. Looks like those are much lower. Though, I suspect I'm just not looking at the right specs.

Regardless, I've learned that RAM has gotten much faster than I would expect. Cool to see.

[1] https://ark.intel.com/products/83503/Intel-Core-i7-4980HQ-Pr...


The 25GB/s corresponds to the fastest speed a DDR4 can send memory.

That said, any nontrivial computation is unlikely to keep up with that speed (e.g. 64-bit data will be ~3GD/s, so whatever computation you do has to fit in amortized 1 cycle on a 3Ghz clock, or the CPU is the bottleneck)

GD=GigaDatum


I'm curious how much of a contributing factor this is to why many deep learning pipelines use smaller floats. Specifically, I "knew" that the throughput some math operations was actually greater than 1. That said, I have not kept up with which operations. I would not have expected the floating point units to be the fast ones.

Again, thanks for sticking with me in this thread. I learned a bit. Hoping to come up with ways to play with this knowledge. :)


Curious whether memory controllers can stripe bytes across slots much like RAID can do for disks.


I don't think it's bytewise (just like nobody does RAID3), but, yes, they can and do. The keyword you're looking for is "interleaving".

I think it's also not even by DIMM slot but by channel, so if the channels are unevenly populated (e.g. 2 slots filled on one channel but only 1 on another), interleaving is disabled.


[The constraining factor in this case is CPU throughput. The CPU cannot execute instructions fast enough to keep up with the data being read in from memory. ]

If you take any basic computer architecture class you'd know that this is not the case. Unless there are breakthroughs within the past 5 years i have not read about. DRAM is at least 10x slower than CPU. Unless your prefetcher is so accurate it can feed the i-cache/d-cache without any stalls on the CPU side.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: