
How misaligning data can increase performance by reducing cache misses (2014) - signa11
https://danluu.com/3c-conflict/
======
jepler
I only wish that at $DAY_JOB I ever encountered a performance problem that was
as subtle as "oh you're not properly using the L2 cache". It's more like, "why
did you choose an N^3 algorithm when you can do it in N log N" or "why is your
design to write the same file on disk 1,000 times to perform one operation?"
sigh

~~~
flamedoge
I wouldn't wish that on anyone to be generous. After long, hard battle, only
those with perverse perseverance may find such solution and only a few can
find joy in doing it.

------
dgwynne
This is also discussed in section 4 of Jeff Bonwick s 1994 paper "The Slab
Allocator: An Object-Caching Kernel Memory Allocator".

~~~
twoodfin
bcantrill (who worked with Bonwick @ Sun) is always touting that paper as a
classic of systems research. I finally gave in and bought a copy of the
proceedings from Usenix where it was presented. Probably the best paper in a
first-rate class of them.

~~~
bcantrill
And if you _really_ want to see me tout it, check out my Systems We Love talk
on the slab allocator![1] ;)

[1] [https://systemswe.love/videos/down-memory-
lane](https://systemswe.love/videos/down-memory-lane)

------
c517402
Old Cray computers had the best memory bandwidth when the fast dimension of an
array was of length 64*n +- 1.

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=7004927](https://news.ycombinator.com/item?id=7004927).

~~~
ma2rten
It would actually be a nice feature to have in HN, if it gave you previous
submissions automatically.

~~~
kornish
There's a button which does that: "past", up right under the post link.

~~~
infogulch
Oh! I don't know why I never noticed that.

You know what would make it better is if it added the number of past
submissions just like total comments.

2 past

5 comments

------
CyberDildonics
This whole article is some sort of nonsense straw man. Of course if you space
small amounts of data spaced out so that they are all far apart you will have
bad performance. When people talk about 'alignment' they are talking about 32,
64 or 128 bit for atomics and 128, 256, and 512 bit for SIMD. Putting 8 bytes
at the beginning of a 4096 byte section and the next 8 bytes right next to it
isn't 'unaligned access'.

Two cache lines of 64 bytes each are cached on even a single byte access, so
of course it will be faster to pack data together, but who thought
differently? It also saves TLB cache misses as well and if accessed linearly
will be prefetched.

~~~
db48x
This isn't about scanning a table where the elements are allocated
contiguously, it's about scanning a linked list where the elements could be
anywhere; the next element is almost never going to already be in the cache.
That's already bad enough (a cache miss on every step down the list), but
those items are all page-alined (they are all in the same position relative to
the 4k memory page that they are in), then they all have a lot of memory
address bits in common and end up evicting each other out of the cache,
slowing your scans down. If you only scan the list once then it's no big deal,
but if this is some frequently-accessed kernel data structure (list of
processes or threads or GDI objects or what have you), then it's a big
problem.

One fix is to ensure that these objects are stored at locations that don't
share so many common bits, or for the CPU to actually have a hash function
that it uses on the address, rather than using the address as-is.

------
n4r9
This must be somewhat related an idea I read about recently. Indexing a binary
heap other than the natural way can avoid cache misses and increase priority
queue performance: [http://playfulprogramming.blogspot.co.uk/2015/08/cache-
optim...](http://playfulprogramming.blogspot.co.uk/2015/08/cache-optimizing-
priority-queue.html)

