I’ll wager that the author didn’t look at the generated assembly as if that loop has only one CPU instruction, that instruction would be an increment, making the analysis of cache effects moot.
However, in the last test I ran, where I was walking indefinitely over the linked list, I guess I was closer to the behavior you're pointing out.
For instance, if each item has a size of 64B (for N=7), which is the size of a cache line, then whatever the position of the first element into the first cache line, the next item will necessary be located into the next cache line. Did I miss something?
Anyway, the interesting case is when items are smaller than a line. Look at an array-based implementation of a heap:
Root is item 0. Assume it's aligned to a multiple of 64 bytes and you have two 32-byte items per line. When comparing array items 1&2, 3&4, 6&7, etc., notice how the two items are always in different cache lines. Item 1 is in the same line as item 0, not item 2. You're always crossing the line boundary. If the root were at item 1 in the same array (or, equivalently, item 0 in case the array started at a multiple of 64, plus an offset of 32), then each pair would be always in the same cache line.
In case it wasn't clear, I was pointing out a similar phenomenon with other data structures; it wasn't about linked lists.
I've also read that the processors are able to ask words (of a cache line) in different orders, and so ask for hot words first. It may help a little if this kind of structure was not aligned in an optimized fashion. Doesn't work for HW prefetch, though.
Could you explain the first part of the sentence (about the list elements spreading over several cache lines)? It's true that the elements do not fit into a cache line for N>7. But is that important for the prefetcher? Shouldn't the regular access pattern to the "en.n" be detected by the streamer prefetcher (without shuffling, of course)?
You're right, this is exactly why a few paragraphs later I doubt about that statement, saying that the prefetcher is certainly not reading the entire array and may act exactly like N=7. I think that this is more about the fact that the prefetcher can't cross page boundaries.
That's still right for N<7, though.
I still have to dig on this part :)
What I found is that my i5 has the code name "Products formerly Haswell". I don't exactly know if it has the "next-page prefetcher", but I would say yes.
Looking at the document that a reader on Medium linked to me
it looks like the NPP has been introduced into the IVY Bridge family (paragraph 2.4.7), and the paragraph 2.3 says: "The Haswell microarchitecture builds on the successes of the Sandy Bridge and Ivy Bridge microarchitectures.".
That said, how the NPP is working is not really clear. I'd say it's maybe more about prefetching the next page to avoid a TLB misses. In this document I didn't found any statement about crossing a page boundary during data prefetches.