I've actually run into this in the "real world" in image processing applications. Given a "magical" width to an image, a pixel directly above (or below) another pixel can have the same alignment with respect to cpu cache. This can mean that things like (large) 2d convolutions are slower on images of very particular sizes.
Unless you are a nutter like me, in which case you hand-roll everything in assembly.
I remember the first time I saw an image data struct that specified bits per pixel, image width and then also had bytes per row. I was a bit perplexed about why you'd store bytes per row if you could just compute it from bpp and width. Turns out you can fix/avoid a lot of performance issues if you can pad the data, so it's good practice to always save the number of bytes per row and use that in your mallocs/copies/iterators/etc.
At Network Appliance there is a guy there who understands cache effects inside and out. File systems it turns our are particularly sensitive to them because 'pointer chasing' (the essence of file system traversal) is often a very cache expensive event.
If your algorithm has ten intertwined loops, does one have any hope of systematically optimizing performance or would you be reduced to "intuition" and easter-egging. (http://www.catb.org/jargon/html/E/Easter-egging.html)
What about "cache oblivious" algorithms? http://en.wikipedia.org/wiki/Cache-oblivious_algorithm
"The other cause of erratic performance—and by far the worst—was 64K aliasing. On most Pentium 4s, if two data items in the L1 cache are at the same address modulo 64K, then a massive slowdown often occurs. Intel's documentation explains this by saying that the instruction that accesses the second piece of 64K-aliasing data has to wait until the first piece is written from the cache, which is to say until the first instruction is completely finished. As you might expect, this can wreak havoc with out-of-order processing. It was 64K aliasing that caused the aforementioned four-times slowdown in our large-triangle drawing benchmark.
"The truly insidious thing about 64K aliasing is that it is very hard to reliably eliminate it. Sure, it helps to avoid obvious problems. For example, pixel buffers and 32-bit z buffers are accessed at the same intrabuffer offsets for any given pixel, so if they encounter 64K aliasing at all, it will happen constantly; to avoid this, Pixomatic makes sure the two buffers never start at the same address modulo 64K (and also not at the same address modulo 4K, to avoid cache contention). But time after time, we've tweaked our data to get rid of 64K aliasing, only to have it pop up again a few days later after more changes were made. About all you can do is run VTune periodically to see if the number of 64K-aliasing events has spiked and, if it has, start backing out changes to figure out what variables or buffers need to be tweaked so they're not 64K aligned.
"On newer Pentium 4s, 64K aliasing changes to 1M aliasing, and in the most recent versions to 4M aliasing, which should reduce the problem considerably. Older Pentium 4s that have 64K aliasing also have 1M aliasing, which introduces an additional penalty on top of the cost of 64K aliasing, so beware of 1M aliasing even more than 64K.
"By the way, on Pentium 4s that have 1M aliasing but not 64K aliasing, the aliasing event counter still fires on 64K-aliasing events, even though they don't affect performance on those chips, so VTune will report incorrect results in such cases. On newer P4s that only have 4M aliasing, VTune will correctly report only 4M-aliasing events."
Edited to add: So you'd be ok with your next interviewer asking you to use the method of Frobenius to solve an ordinary differential equation, because, "hey, I had to learn that as part of my CS math prerequisite courses, so you should know it too, even though you're just interviewing for a web dev job?"
Differential equations was not a part of my CS curriculum, but I imagine it would be fair game for any engineer. I would not expect them to solve the answer for me on a whiteboard right there, but I don't know what the equivalent to "simply identify what the question is" would be in this case.
It's always better to align the interview with the job (and future tasks they may be involved in), not some idealized CS background that you think everyone should have.
I think that you are not giving your average "web developer" enough credit.
This isn't about "formal CS curriculum", this is about a passing familiarity of modern computer architecture.
CPU caches are indexed on the low bits of the address. If you have two addresses with the same low-order bits, then those addresses will collide in the cache, even though the cache isn't full.
Large allocations (e.g. huge arrays) tend to get their own memory pages allocated - which means that the same offset in two different arrays ends up having the same low-order bits.