
Hardware Store Elimination - matt_d
https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html
======
bitwize
Isn't hardware store elimination when you order all your lumber, sheet metal,
tools, etc. online and have them shipped straight to your door?

~~~
BeeOnRope
In retrospect, this was quite a poor name. I won't want anything to think
their Home Depot will be taken away from them...

Perhaps someone can rename this to "Skylake CPU Store Elimination".

------
shaklee3
Excellent article, as usual, Travis.

------
ylk1
Nice read!

~~~
ylk1
Can you try micro bench marking Apple Cores? Nobody seems to be doing that,
yet people claim they are the best in business.

~~~
BeeOnRope
I don't have any Apple cores easily available, but the code [1] is open if
anyone wants to try it (I don't know how POSIX-y the iOS compile environment
is, though).

One caveat is that just because you don't find a performance difference,
doesn't mean the optimization isn't happening. It could simply be the case
that write throughput is not the limiter, but rather the latency * occupancy
product is the limiter. E.g., if it takes 50 ns to go from L2 to RAM, and
there are only 10 buffers available to hold these requests, then the maximum
bandwidth is 64 bytes / 50 nanos * 10 buffers = 12.8 GB/s regardless of the
maximum possible bandwidth of each component.

An eliminated store may still take this full latency (since it still has to
read the value from RAM), so even if all writes are eliminated, the
performance may remain at 12.8 GB/s - but you would save power and memory and
L3 bandwidth for other cores... but the optimization would be tough to detect
by looking at performance alone. You'd need to look at performance counters
(does those exist for iOS devices?).

[1] [https://github.com/travisdowns/zero-fill-
bench](https://github.com/travisdowns/zero-fill-bench)

