
Why Intel Added Cache Partitioning - dangerman
http://danluu.com/intel-cat/
======
tkinom
16 years, I created a profiling macro system on for select set critical
functions. The profiling system can switch the measuring data from
#instruction, #cpu_clk, #branch_stall_cycles, #L1_miss, #L2 miss for all the
critical path functions.

After analyzed the data, I found branch stall cycles and data access stall
cycles were causing huge number of delays in the critical code path.

I used the following tricks to get rid of the stall cycles.

1) Use branch_likely to force gcc to make sure there is no branch at all in
the critical path of executions. (save 30+ cpu cycle per branch, there are a
lot of branch stall cycles if one just simplely follow the gcc generated
"optimized" code. MIPS CPU 200Mhz)

2) Use prefetch ahead of data structure access to get rid of un-cache data
delay. (save ~50+cpu cycle per data stall, also, there are lot of them in the
critical path.)

3) Use inline functions, etc to get rid of call stalls in critical path.

The system got ~100x increase on the overall system thru-put with those
techniques with just pure C optimization from standard -O2 build.

I think it might be possible to create a build system that can automatically
collect the profiling data (branch stall cycles and data stall cycles) and use
the branch likely and prefetch instructions to auto-optimized the critical
path code.

Specifying which code path / function call sequences are the real critical
path probably still require programmer's touch.

As result of using data prefetch code in proper place, I don't used cache
locking nor doing any kind of CPU affinity trick to generated the optimized
obj code without any stall cycles for critical code path.

~~~
xavierd
A lot of those optimizations would no longer yield any benefits[0]. The CPU
archictecture evolved a lot in 16 years, especially in branch/code prediction
to the point where a correctly predicted branch (without branch_likely) has
almost no cost.

[0]: At least, this is true for x86 CPUs.

~~~
e5f34f89
As a CPU architect, I can confirm that all those except possibly 2) will not
yield significant benefits. Prefetching hints will only be useful when the
particular code fragment is highly memory-bound because most wide superscalar
microarchitectures will easily hide L1/L2 miss latencies.

~~~
fanf2
My qp trie code <[http://dotat.at/prog/qp/>](http://dotat.at/prog/qp/>) got a
performance boost of about 20% by adding prefetch hints in the obvious places.
The inner loop is fetch / compute / fetch / compute, chaining down into the
true. The next fetch will (usually) be some small offset from a pointer we can
get immediately after its preceding fetch, so prefetch the base pointer, then
compute to work out the exact offset.

------
titzer
This also helps prevent cache side-channel attacks like the one that led to
stealing RSA private keys by probing the L3 miss times.

[https://eprint.iacr.org/2015/898.pdf](https://eprint.iacr.org/2015/898.pdf)

~~~
nickpsecurity
You beat me to it. Partitioned caches have been explored for years to defeat
covert timing channels. Other measures were used decades ago with varying
success. I was excited at the title thinking Intel was finally on some
bandwagon for suppressing timing channels given cloud market, security
concerns, etc.

Another performance enhancement lol. That's good, too, but damnit I was hoping
they mentioned timing channels. It will have to be thoroughly scrutinized
before relied on for that but it's a start for sure. Hopefully, more than
that. :)

------
jharsman
I find it really weird that the article says:

> It’s curious that we have low cache hit rates, a lot of time stalled on
> cache/memory, and low bandwidth utilization.

That's typical for most workloads! Software is almost never compute or
bandwidth bound in my experience, but instead spends most of its time waiting
on memory in pointer chasing code. This is especially true for code written in
managed languages like Java (since everything typically is boxed and allocated
all over the heap).

~~~
paulsutter
It's curious that there's low bandwidth utilization despite the low but rates
/ cpu stalled waiting for memory. Don't you think?

Perhaps random access of small data causes frequent waits without utilizing
the bandwidth in a way that block copies would.

~~~
jharsman
You don't get high bandwidth utilization by pointer chasing unless you have
many threads doing it and you switch threads while waiting on memory. That's
true for GPUs, not for typical server workloads running on CPUs.

------
dunkelheit
To me putting high-priority low-latency process and a low-priority high-
throughput process on the same machine is a recipe for disaster - they will
find _some_ way to get entangled and hurt each other. Kudos to google for
pulling this out but the article shows that you need expertise on all levels
of the stack to attempt this. Most of us are better off with simply buying
more boxes :)

~~~
kozyraki
Or we can automate this in an OS/cluster manager and safely this with everyone
like we do with some many other technologies that are difficult for most of us
to develop :)

~~~
dunkelheit
One can dream... Of course having this available as an opensource technology
would be great but for now it is a piece of proprietary tech inside google.

------
hapless
The SPECint and SPECfp benchmark suites were never very tightly connected to
real-world performance, but I wonder what these figures look like for SPECjbb,
the java benchmark.

(If I recall correctly, SPECjbb spins up an app that looks a lot like "java
pet store" and runs clients against it.)

~~~
Symmetry
What I've heard is that you should mostly pay attention to the gcc compilation
part of SPEC if you want a good indication of real world performance.

~~~
ofrobots
Never use a single workload as a predictor of performance of the universe of
all other workloads. The best benchmark is not a benchmark at all – it is your
actual workload.

------
uxcn
Regarding cpusets, I wonder how much contention is being created just by
software assuming the number of cpus match the number of cpus in the affinity
set. For example, C++ has _hardware_concurrency_ now, Python has _cpu_count_ ,
Java has fork-join, etc...

~~~
thrownaway2424
Golang also has a totally inaccurate routine for counting the CPUs in the
machine.

~~~
nulltype
What's the accurate way?

~~~
uxcn
To calculate the number of processors available or the optimal number of
threads/processes?

~~~
nulltype
Well specifically it would be cool if it counted the optimal number to set
GOMAXPROCS to, since I think that's like the main use of runtime.NumCPU().

~~~
thrownaway2424
Right. Currently runtime.NumCPU tries to be fancy by looking at the population
count of the cpuset mask[1]. However in a hosted environment using containers
there's no reason to believe that the cpuset will remain fixed over the life
of the process. This can undercount the available CPUs, leaving you with a
GOMAXPROCS that is too low.

1:
[https://code.google.com/p/go/source/browse/src/pkg/runtime/t...](https://code.google.com/p/go/source/browse/src/pkg/runtime/thread_linux.c?r=ef1158a7371796bf4823a1ce43e3d01d2a765e14#85)

~~~
vessenes
Anecdotally, it's very often not bad, (and in fact sometimes "good") to over-
provision MAXPROCS. We have used as much as 3 to 6x the number of
hyperthreaded cores with good results, depending on the workload. This could
insulate you against some container changes.

------
thrownaway2424
Summary: because it makes your computer faster.

~~~
thrownaway2424
Geez, tough crowd here. How about this:

Summary: because Google asked for it.

~~~
Retr0spectrum
Keep digging.

------
amelius
> Xkcd estimated that Google owns 2 million machines

Xkcd doesn't sound like a very convincing source. And at least this estimation
should have some error-bars.

~~~
onion2k
Of all the sources outside of Google, I'd say XKCD is likely to be closer than
most. Even if you ignore the fact that Randall (XKCD) spends a good amount of
his time estimating things in a rigorous and scientific way that would mean
his estimates are actually pretty good, he has the exact background that means
people at Google would take his call if he picked up the phone to muse upon
the question of how many machines Google has. They might not give specifics,
but helping would be a pretty cool thing to do so they'd probably do it.

~~~
keeperofdakeys
Such a situation did happen, Randall answered how many punch cards it would
take to store all of the data in google datacenters ([https://what-
if.xkcd.com/63/](https://what-if.xkcd.com/63/)). In response, google sent him
punch cards ([http://blog.ted.com/using-serious-math-to-answer-weird-
quest...](http://blog.ted.com/using-serious-math-to-answer-weird-questions-
randall-munroe-at-ted2014/)).

