
Gallery of Processor Cache Effects - arunc
http://igoro.com/archive/gallery-of-processor-cache-effects/
======
raverbashing
Very interesting (but not exactly news)

But I suppose in this "modern world" most people forget how their processors
work

~~~
jacquesm
Not sure why you're downvoted, your observation mimics what I experience day-
to-day. Memory management is magical, network bandwidth is infinite and
networks have zero latency, cpu cycles are infinite, caches are infinite and
harddrive transfer rates and seek times are so fast it doesn't matter. Until
it does.

Case in point: I just got called in on a project that was on the skids. 8 HP
blades, 128G each, HP 3Par storage unit underneath it, network capacity by the
gallon and it still wouldn't perform with a measly 9000 users per day for a
pretty simple website. Loadspikes and restarts to get the system to be
available again were the order of the day. Existential threat to the company
that had ordered the system to replace their previous system.

After reducing the 130 or so VMs they were using to _one_ and getting rid of
all the superfluous hardware and interconnect little by little the system
started to work better. Another 3 weeks of fixes later we're running
comfortably on that single machine, loads are < 1 during any 24 hour period.
The waste and impedance mismatch between the applications and the hardware was
simply painful.

And all this was built by so called 'experts', one guy calling himself a DBA
(wonder if he realized that you can have indices on more than just the primary
key field), a bunch of other gurus and a remote team of developers. Each and
every one of them may have their qualities but as a collective with crappy
management they managed to make a system that could support a few million
daily users crawl trying to support a few thousand.

Understanding how your processor works is only one small element in that
story, I think that if you add to that forgetting how your hard drive works,
how the computer as a whole works and being unfamiliar with the cost of
communications as well as with how operating system level tuning works you
might be closer to the truth.

What bugs me is not that this stuff happens. What bugs me is how often it
happens.

~~~
CrLf
What you're describing is business as usual in any enterprise shop...

1\. hardware requirements call for an absurd capacity for applications with
just a few hundred users

2\. developers then proceed to utterly waste the available resources because
they seem to be infinite and "nobody cares"

3\. the application ends up being slow as molasses and the hardware
requirements for the next similar project become even bigger

(There's also a number 4 in here, which is the belief that every high-traffic
website has a Google-sized datacenter behind it, which is simply not true. You
can serve millions of users with just a few machines.)

This is a never-ending cycle of wastefulness just because nobody ever stops to
think about how the machines work and how to apply proper algorithms. The
classic argument of developer time vs. hardware costs is a funny one: often
bad solutions take just as long as good solutions to implement. But the latter
require knowledge that many developers seem to lack and it's easy to think
that smart == time.

Take ORMs like Hibernate. Developers use them because it saves them time, but
then weeks are wasted trying to optimize the applications in the wrong places
just to avoid having to think about what's under the ORM. I used to be a DBA
and performance improvements of over 1000x were pretty common just by creating
a few indexes in 5 minutes of my time (this was actually fun... "your
operation that took 30 seconds now takes 100ms, you're welcome"). All while
developers looked around their own code trying to fix the wrong thing.

Memory initialization is another example: I've been in the situation where an
application was generating hundreds of MB of new objects per second, which I
pointed might be the cause for it to be so slow (it was certainly the cause
for it to eventually crash as the GC couldn't catch up). They still wasted
days looking at other things until finally understanding that this behavior
meant the CPU was working as if it had no cache at all.

Oh, and the constant complaints that the machines were slow when they were
sistematically under 10% utilization? Locking and IO latency are also alien
concepts for some...

~~~
Sami_Lehtinen
It's wonderful to see posts like this. I've been often very frustrated by
exactly similar issues. But this seems to be much more common than I thought.

One guy used select max(id) from table; for over 5gb table without id column
index, periodically. It took quite a while every time maxing the disk I/O.

Usually developers claim that the server doesn't have enough memory or it
should jave more cores and we should buy enterprise SSDs. Which is simply
insane when you could fix the issue very easily.

------
WayneS
I like to combine the graphs together into one picture and then another
processor structure appears the picture:
[https://dl.dropboxusercontent.com/u/4893/mem_lat3.jpg](https://dl.dropboxusercontent.com/u/4893/mem_lat3.jpg)

This graph show the memory latency for a linked list walk that strides across
a certain size buffer in memory. I used to use this picture for interviews and
ask people to explain as much as they can about the processor from the
picture. It end works for people who know nothing about processor architecture
as I can walk them over what it says and see how they think and react to new
information.

~~~
phaker
What processor is it? Or at least: how old is it?

~~~
WayneS
Old now '95\. That is the an early pre-production version Pentium Pro running
at 100 Mhz. The 256k L2 cache was really fast relative to memory because it
was in that second die in the same chip.

------
joseraul
Most examples are classical cache effects, but the last one is such a puzzle.

    
    
      A++; C++; E++; G++; 	448 ms
      A++; C++; 	        518 ms
    

How can incrementing 2 variables be slower than incrementing 4 variables?

~~~
to3m
Perhaps the result is an average rather than the best result?

I wondered about this too so I tried it on my PC, though I had to make up my
own timing code since the author doesn't say what he was doing. Results for me
were more in line with what I'd expect, with both 4-variable cases being the
same speed and the 2-variable one being a bit quicker. (All the data for the
loop will fit into 2 or 3 cache lines so I don't think there's much chance
you'll see any memory effects being measured.)

This is compiling for x64.

------
hintss
Not quite processor cache effects, but Duff's Device is pretty cool too

