
The Ivy Bridge and Haswell BTB - ingve
http://xania.org/201602/haswell-and-ivy-btb
======
hatsunearu
Cool article, but if I may say, rainbow graphs are hard for me (and many other
people) to parse.

[http://betterfigures.org/2014/11/18/end-of-the-
rainbow/](http://betterfigures.org/2014/11/18/end-of-the-rainbow/)

[https://eagereyes.org/basics/rainbow-color-
map](https://eagereyes.org/basics/rainbow-color-map)

Not that it makes the article significantly weaker though, but yeah.

~~~
mattgodbolt
Thanks for the feedback! I must say graphs and whatnot are not my forté - I
spent nearly two evenings trying to get them half-decent. Patches welcome, of
course :)

------
vardump
That was an interesting and insightful article about CPU branch target
buffers. Learned something new about Haswell and Intel BTBs in general.

Haswell's branch prediction seems to be pretty nice. I bet that makes quite a
bit of difference in branchy code. No need to care so much about arcane rules
about branching as in the past.

Haswell+ Xeons will likely be good with branchy code, especially avoiding
those old pathological cases. Predicting 4096 branch targets with such an
accuracy is very good.

~~~
nn3
Yes the Haswell branch prediction is so good that it obsoleted previous "best
strategies" for fast interpreter loops.

See
[https://hal.inria.fr/hal-01100647/document](https://hal.inria.fr/hal-01100647/document)

But unfortunately a lot of real code is so big these days that it thrashes all
caches, including branch prediction.

Big is slow.

~~~
vardump
There are just 512 cache lines in both L1 code and L1 data cache.

512 * 64 = 32 kB. 512 is 8 * 8 * 8, I'd guess if they add more (up to 4096) L1
cache lines that access would take 1 clock longer. Which is very likely a net
performance loss.

Maybe it's time to increase cache line size to 128 bytes. That could of course
break a lot of old code performance wise. It'd cause more false sharing [1]
and double memory bandwidth requirements for random access. Maybe we should
start to organize shared data to have 128 byte alignment.

It'd also break some read bandwidth-saving code that assumes 64 byte streaming
(non-temporal) stores eliminate need for RFO [2] access.

Writing a _single byte_ to RAM would require CPU core to first read a 128 byte
cache line (RFO [2]) and then eventually write that cache line back. Currently
"only" 64 bytes need to be read and written in the same scenario.

128 byte cache lines would mean L1C and L1D caches are both 64 kB, up from
current 32 kB. That'd definitely help with monstrous codebases.

CPU design is full of compromises...

[1]: Atomic ops unintentionally touching same cache line cause performance
loss through false sharing. The effect can be very significant, up to 2 orders
of magnitude.
[https://en.wikipedia.org/wiki/False_sharing](https://en.wikipedia.org/wiki/False_sharing)

[2]: Read for Ownership.
[https://en.wikipedia.org/wiki/MESI_protocol#Read_For_Ownersh...](https://en.wikipedia.org/wiki/MESI_protocol#Read_For_Ownership)

~~~
sliverstorm
For dram reads, to my knowledge you can't read only 64kb anyway. You get much
larger pages from dram.

~~~
vardump
> For dram reads, to my knowledge you can't read only 64kb anyway. You get
> much larger pages from dram.

I didn't say 64 kB, I said 64 bytes is the read and store transaction size.

I think DRAM page sizes are 2^n * 512 bytes, where n is a positive small
integer. Typical DRAM bank sizes are 512 bytes, 1 kB or 2 kB.

So, with interleaved memory channels DRAM page changes every (DRAM page size)
* (number of memory channels).

I think typical bank switch intervals for a laptop with 2 memory channels is
every 1 (DDR4 minimum), 2 (DDR3 minimum, probably typical) or 4 kB.

Although DRAM bank sizes are irrelevant in this context.

I was talking about the smallest possible _cached_ DRAM transaction -- read or
store. Which is the same as the cache line size, 64 bytes.

Simplified:

Read 1 byte from memory and the CPU will fetch 64 bytes.

Write 1 byte to memory and the CPU will first fetch 64 bytes (RFO), modify the
cache line and eventually when the cache line is evicted (which can be quite a
while) write 64 bytes back.

