
How misaligning data can increase performance 12x by reducing cache misses - luu
http://danluu.com/3c-conflict/
======
cowsandmilk
I see many comments here acting like this means not aligning on word
boundaries (e.g. using packed pragma, sandy bridge have unaligned access,
etc.). This has nothing to do with word alignment. As the article states at
the beginning, this is about aligning to page boundaries, which is on a very
different level than word boundaries or structure packing. Let's not get these
confused.

------
barrkel
It's important to at least know about the n in n-way set associative cache
(i.e. that it exists) and this article is a good reminder. Next time you see
data accesses that ought to be fast (ought to be hot in cache) but seem not to
be, this is another thing you can look for.

It's easy, from a software engineer perspective, to know that your CPU has
cache, and just think it's like any other cache you might implement in
software. But the implementation details - the hashing by masking the address,
and only having a few slots available for things that hash to the same bucket
- are actually important, as shown here.

------
mtanski
One of the things to keep in mind here is that this trick won't work or work
as well for every architecture or every processor family in that architecture.

Some architectures do not support unaligned memory access and will raise an
exception. If you're using things like packed attribute with your structs your
compiler will generate the correct code but that code will be slower. In
almost all cases it will generate many more instructions and because of that
your cache will be less effective (due to larger code size) your decoder cache
will be less effective, etc...

The author has a more modern Intel processor. The x86 family always supported
unaligned access, albeit it was always slower in terms of cycles. More recent
Intel processor have made this penalty much shorter. I believe this was driven
network applications many of which focus on efficiency of packing as many
bytes down the channel and less on the alignment requirements.
[http://www.agner.org/optimize/blog/read.php?i=142&v=t](http://www.agner.org/optimize/blog/read.php?i=142&v=t)

~~~
cube13
You're talking about word aligned boundries, which is absolutely an issue with
certain architectures(I remember dealing with the issue with older SPARC
processors). The article is talking about L2 and L3 processor cache hash
collisions, which can result in lost performance as the cache's are
overwritten.

This optimization doesn't preclude those architectures necessarily, it's
saying that instead of allocating at address 512, 1024, etc., there might be a
boost from allocating at off-page addresses.

~~~
mtanski
You are correct.

------
ori_b
The usual term for this is "cache line colouring", and many allocators do this
for you.

[http://en.wikipedia.org/wiki/Cache_coloring](http://en.wikipedia.org/wiki/Cache_coloring)

------
joe_the_user
I understand this only enough to understand that it may or may not be a
surprising and paradoxical result.

As a not-systems c++ programmer, it seems to reinforce the usual lesson: don't
optimize your structures initially for anything but readability, once you have
your system running and can pinpoint the bottlenecks, then try things like
aligning structures to various things. But naturally always have a real-world-
like test-suite to verify you are improving things.

Sorry if this is boring.

------
mikerg87
Interesting. One question. How would I take advantage of this ? Are there
special flags for the compiler or special memory allocator that I would need ?
Do libraries like BLAS already account for this ?

~~~
acqq
I'm not aware of any flags of C compilers that would "fix potential cache-
slot-collisions" for you automatically, neither of any allocators. Intel
processors have some profiling registers that can point you to this kind of
problems, but typically you first have to know what you want to profile.

Compilers are actually built to align the structures you write as there are a
lot of processors which have significantly slower access to the misaligned
values. Allocators also have to return you aligned addresses for each
"malloc." The newest Intel iX processors are actually an exception in being
able to amortize misaligned accesses.

I've actually used hand-made unaligned "string" stores. As soon as you
allocate some bigger memory block and store the character sequences of the
variable size one after another, their starts won't be aligned unless you want
that. For doubles and other fixed-size values it's still better to keep them
aligned.

Moreover I wasn't able to extract some value out of the article. Something
constructed can be constructed to be slow or fast, fine. But I don't see
anything that would inspire me to improve my real code. Maybe somebody who
reads the article manages to produce such examples?

~~~
alexkus
You can always write your own incremental allocator to provide word aligned
(or whatever alignment) blocks of memory (from a large block of memory you
obtain via a usual malloc() call) for the individual structures such that
their page offsets are spread evenly (and avoid other problems such as
spanning pages).

[EDIT] although it is tricky to do optimally given that different processors
will have different cache set characteristics (as the article shows).

~~~
acqq
"Spreading" which will produce optimal cache use is something that depends on
dynamic and not static properties of the program, so you'd have to "spread"
differently depending on the use patterns. I can't imagine any universal
solution.

And most of the programmers make much bigger omissions than those mentioned in
this topic. Like using wrong algorithms, wrong libraries, doing too many
allocations, having bad structures of the data... So this topic effects are
invisible unless you already fixed other issues.

~~~
alexkus
True, but having every structure aligned to exactly the same offset into each
page is extremely unlikely. Given a page size of 4KB then a whole bunch of 32
byte structures aligned in such a way would represent a huge waste of memory.
No sane allocator will allocate things like this.

If your structures happen to be very close to the system's page size it could
easily happen, then you'd need to avoid this yourself with your own
incremental allocator (or other tricks).

Definitely agree with your last point, I've seen lots of code (in commercial
applications) where people are optimising completely the wrong thing.

------
nkurz
Hi Dan --

Great example. I looked briefly at the source, and wasn't sure whether
"pointer_chase" was on or off in your graphs. Or maybe it didn't make a
difference?

 _Page-aligned accesses like these also make compulsory misses worse, because
prefetchers won’t prefetch beyond a page boundary. But if you have enough data
that you’re aligning things to page boundaries, you probably can’t do much
about that anyway._

To the contrary, I think this is one of the relatively rare cases that
explicit prefetching can help you. But maybe this helps only once your sets
are too large for L3?

I wrote a little a few months ago on my attempts to speed up a Stream
benchmark for Sandy Bridge that might have some overlap with your post:
[http://software.intel.com/en-
us/forums/topic/480004#comment-...](http://software.intel.com/en-
us/forums/topic/480004#comment-1763753)

------
danso
> _Well, I’ve now managed to blog about three of the areas where I have the
> biggest comparative advantage. Three or four more blog posts and I’ll be
> able to write myself straight out of my job. I must be moving up in the
> world, because I was able to automate myself out of my first job with a
> couple shell scripts and a bit of Ruby. At least this requires some human
> intervention._

This was a great explanation of caching architecture, but I really want to
hear the story of how the OP automated himself out of his first job with some
shells scripts and Ruby.

------
dded
Many modern processors hash some of the address bits before using them as a
cache index to avoid these problems.

To avoid them in code, it's often sufficient to round to convenient decimal
numbers when allocating arrays, instead of powers-of-2. That is, allocate an
array of 1000, instead of an array of 1024.

------
scott_s
The linked usenet discussion is worth reading, partially because a student
asks for help on their homework, and it eventually pulls in Linus Torvalds:
[https://groups.google.com/d/msg/comp.arch/_uecSnSEQc4/jkdcQc...](https://groups.google.com/d/msg/comp.arch/_uecSnSEQc4/jkdcQcRatXoJ)

The actual discussion is good, too.

------
colanderman
Why don't caches store data based on a linear hash of the address, rather than
simply the low-order bits? This would retain the property that an aligned
block of data can be stored without collision, and would extend this benefit
to page-strided data. Even a "hash" as simple as (low order bits) XOR (middle
bits) would provide this benefit.

------
bwzhou
Why would you want to make an in-memory data structure page-aligned instead of
cache-line-aligned?

~~~
rossjudson
It's not that you would want to. It's that certain data structures
(particularly in kernel land) naturally form around powers-of-2 sizing. Since
the hashing in an n-way associative cache is done by masking away bits, you
can get into this nasty situation where multiple elements of your data
structure end up hashing to the same location set (the N in an N-way
associative cache). You don't want that if you are going to walk the elements
in your structure.

Either avoid walking the elements, or do whatever it takes to ensure that you
don't end up getting stuck behind the N.

Hardware is inescapable.

It's also worth noting that you can use this to your advantage, ensuring that
accesses to certain elements does _not_ push much else out of the associative
cache.

------
brokenparser
Can someone explain those charts?

~~~
jbl
The OP is plotting the _ratio_ of page-aligned vs. unaligned access time
against problem size. So, on the y-axis, a value of 2 means the page-aligned
access took twice as long as the unaligned access.

Took me a few moments to grok the charts too, since I haven't had my coffee
yet.

~~~
jere
Ah good explanation. The use of "vs" in a graph title makes me assume it's
describing X vs Y, which is confusing.

It also helped when I realized the graph starts at 1 for working set size of 8
and the author corresponding says:

>Except for very small working sets (1-8), the unaligned version is noticeably
faster

------
ww520
Darn, really unexpected. Learn something new everyday. Page-aligned access is
common because I/O is usually aligned on page. Now CPU intense cases are
different because of cache line usage.

