

How Misaligning Data Can Increase Performance 12x (2013) - mattgodbolt
http://danluu.com/3c-conflict/

======
TillE
I've read this article before, and I have no idea why anyone would think that
_page_ alignment is a good idea, unless you're doing kernel-level stuff that
absolutely requires it.

Optimization for performance has always been about cache lines (64 bytes), not
pages (4 KiB). Of course you're going to get terrible results when you're
wasting huge amounts of memory.

~~~
somerandomone
It's probably because if your data is spanned across two pages, it may require
a lot of page swaps if you access data back and forth?

~~~
TrainedMonkey
I think your premise is correct, if our data all on one page we only need to
fetch one page. If we write anything back, only one page needs to be updated.
In case of two pages, read/write work doubles.

~~~
MichaelGG
And the prefetcher cannot fetch across page boundaries since that might fault.

~~~
TheLoneWolfling
This seems like a hardware flaw to me.

We can do OoO execution, why can't we do OoO prefetching w.r.t. page faults?
(I.e. try to fetch into a cache, if there would be a page fault don't. If
there's something that would cause that fetch to page fault between it being
prefetched and it logically being fetched, invalidate the cace.)

~~~
thesz
This was actually done in some early DEC Alphas, I believe.

OoO engine were able to issue address calculation from loads (if address
register is ready) ahead and thusly when real load instruction execute the
data is already in cache or closer to cache.

It is really easy to do, actually. It can even simplify page fault handling in
CPU.

------
colanderman
I've always wondered: why do CPU caches only key off the low-order bits of the
memory address? e.g., DDR controllers generally key off a linear hash of most
of the address.

Is gate count really that tight in L1, that they can't throw a few XOR taps in
front of the cache bus? Or is it simply to make cache collisions more
predictable?

~~~
etep
Gate count is not the issue - the issue is L1 timing. For clocks at 2 GHz
plus, there can be very few stages of logic in between flops. Just decoding an
address to a one hot wire (necessary to access the memory bank) takes up a
good chunk of that time interval. Reading the bits out and resolving them to
back to full rail also takes time. Checking for a tag match, more time.
Routing that back to the register file, yet more time. If it's not the L1
cache, say shared L3, the total time of flight across the chip is multiple
clock cycles, not to mention all the aforementioned time penalties in addition
to the longer access latency into even larger shared L3 cache.

Relatedly, in the L3, there is a hash function used to distribute different
address to different regions of the L3. The cost of doing this is less
significant in two ways: the L3 access latency is already much much higher (as
elaborated on above) and the hash calculation can be done in parallel with
other required logic (e.g. in parallel with L2 access).

~~~
colanderman
Thanks, that was a great answer.

