
Memory Access Patterns Are Important (2012) - wanderer42
http://mechanical-sympathy.blogspot.com/2012/08/memory-access-patterns-are-important.html
======
chris_va
One critical thing I wish I could go back and remind my younger self:

The cache is shared across processes.

The net result is that cache latency/hit rate is highly dependent on what else
is using the machine. CPU bound tasks (like, say search engine backends) tend
to leave extra storage space (say for things like image serving). Streaming
images out of a machine can completely thrash the cache without using many
cores. When your net latency is based on the slowest percentile to come back,
one badly co-scheduled process can ruin your entire cluster capacity.

~~~
packetslave
Intel finally added cache partitioning in Broadwell, which lets you restrict
parts of the cache to certain cores. Combining that with pinning tasks to
cores can help with cache thrashing.

~~~
bogomipz
Is that enabled by default? Turned on in the BIOS or early boot?

------
joshbaptiste
Martin's videos on lock-free algorithms are very informative and gives great
insight on how to push the JVM to the extreme.
[http://www.infoq.com/author/Martin-
Thompson#Presentations](http://www.infoq.com/author/Martin-
Thompson#Presentations)

------
malingo
This kind of article stands in interesting contrast to the recent post on
"non-programmers' solutions to programming problems."

On that post, jbclements commented [0] that we are reaching / have reached a
point where "the conversation shifts from 'how can I meet the machine's
needs?' to 'how can the machine meet my (programming) needs?'"

The idea of "the machine meeting my programming needs" still requires a
programming platform that has been designed with a great deal of mechanical
sympathy.

[0]
[https://news.ycombinator.com/item?id=11509543](https://news.ycombinator.com/item?id=11509543)

------
nulltype
If main memory is so slow, would it be better to treat cache as memory and
handle it directly in code rather than try to trick it into doing what we
want?

~~~
corysama
[https://en.wikipedia.org/wiki/Scratchpad_memory](https://en.wikipedia.org/wiki/Scratchpad_memory)

On the original PlayStation, the scratchpad was a PITA. You had to weigh the
speedup against the cost of manually copying in and out the the pad. A modern
implementation would copy via DMA like the PS3's Cell processors. Still a
PITA, but at least the payoff for the effort can be quite good.

~~~
clevernickname
Came here to post this.

The other advantage of cache over explicit control of the on-chip SRAM is that
it allows code to be inherently forwards and backwards compatible, rather than
tying the code to a specific machine with a specific amount of on-chip memory.
And I imagine the nightmare would only increase if you had to factor
multitasking into an explicit scheme.

The primary benefit of explicit on-chip memory is not, AFAIK, that you can
manually manage it significantly better than a cache, but that it takes up
significantly less die space and has lower access latency. You really see this
idea shine in tiny cheap microcontrollers that have no external memory.

------
amelius
See also Ulrich Drepper's "What every programmer should know about memory"
[1], mentioned in one of the comments.

[1]
[https://people.freebsd.org/~lstewart/articles/cpumemory.pdf](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf)

------
alexhutcheson
I did a similar project a couple years ago to benchmark the effects of
different memory access patterns. The whitepaper [0] includes some graphs for
how performance changes as you add more threads on different cores.

[0] [http://stoneridgetechnology.com/wp-
content/uploads/2014/12/C...](http://stoneridgetechnology.com/wp-
content/uploads/2014/12/ComputevsMemory.pdf)

------
drudru11
In the old days, mechanical sympathy was called RTFM

~~~
oldmanjay
There are a lot more manuals these days, and they are way longer. Also it
turns out that a lot of code from those old days was crap, so maybe RTFM was
inadequate.

~~~
hinkley
Most of those manuals were shitty to begin with.

You end up having to read the entire thing to properly understand a part of
the functionality that got its own chapter. And don't get me started on the
toy examples. The happy path is the least useful example you can use. I would
stumble on that by myself.

~~~
drudru11
Seriously, I should have been more specific. If you take a comp. arch. class,
then you get the generalization. After that, you can just read the relevant
part of a manual.

