
Too much locality for stores to forward - signa11
https://pvk.ca/Blog/2020/02/01/too-much-locality-for-store-forwarding/
======
asdfasgasdgasdg
This is a great little article. I find too many programmers don't understand
the value of batches, at every level of code. Batching is super important,
both to amortize fixed costs and to reduce serialized blocking for "IO" (this
case, if we define IO as "operations which fetch data from a slower data
source than the one upon which we're operating").

There's another mistake programmers are prone to make after learning about the
importance of batching. If batches of ten are good, why not a hundred? Why not
a million? Lots of people make the mistake, after learning the value of
batches, of assuming that bigger is always better. This isn't the case either.
There are a few reasons why too large batches can also cause problems. For
streaming work, large batches hurt latencies of initial results. Access
locality can also be harmed when batches are too big. If your cache can fit a
hundred of your objects and you perform operations batch-wise in batches of a
thousand, you have to fetch each object from main memory per batch. If you
worked in units of a hundred instead, then you would fetch each object from
main memory once.

I find there are a couple of good rules of thumb for batching. Batches should
be as small as possible, as long as they are large enough to solve the issues
I mentioned earlier. I like to target batch sizes that are large enough that
the fixed costs are no more than about 5% of the total costs of executing the
batch. In the case of CPU-local work like the one described in this article,
you would pick batch size s.t. operation 0 and associated fetches have
finished on batch element 0 before you loop back to it to begin work on
operation 1. In network situations where tracking batch completion and context
switching is less expensive, this doesn't make as much sense. Just make it so
that multiple batches can run concurrently, and issue multiple network
requests to handle each one.

~~~
magicalhippo
At work I hit this issue just the other week. Customers reported performance
issues. The issue was that they were pushing more data than we expected, but
then again we should have expected that...

Some intricate core code had to do like 10-15 queries per invoice line. Each
is quick, but it adds up when you have 10k lines. The code my coworker wrote
to do this was written in a per-line way.

Now, since this code was called from other parts of our program, where it
might just operate on a single line, I couldn't just cache willy-nilly inside
the module. If so I'd have to introduce some calls to invalidate the caches or
whatnot. Easy to miss and easy to screw up and get stale data.

However, if instead the API had been batch oriented from the start, it would
still work fine on a single line, but also allow for more optimizations for
actual batches. The cases where it was called for a single line at a time was
when the user was punching through, so being sub-optimal for that would not be
an issue.

Gonna be fun trying to disentangle this and try to make it batch friendly...

~~~
triztian
Seems like a good option would be to implement a second version that has the
batch API but leave the line oriented one as is for use by the callers expect
it to behave in a per-line oriented way.

~~~
magicalhippo
Yeah, still won't be easy to integrate the batch optimizations now, since all
the code in this module is written with the single-line-at-a-time assumption.

~~~
spockz
It seems to me that this would be an area where something like reactive
streams excel at. The interface is by default streaming and the code dealing
with the database could decide to buffer the incoming stream to an optimal
amount for creating a query. (Of course only buffering until a maximum latency
was reached.) This way neither the producer nor the consumer need to be aware
of the exact buffer requirements and the whole can be optimised using back
pressure.

~~~
magicalhippo
Would be interesting to try, sadly not any library support in the language we
use.

------
ncmncm
Backtrace hires thoughtful people. (Disclosure: didn't hire me. :-) It seems
to come from the top: Sami al-Bahra is thoughtful.

Backtrace does what was to me a surprising service: they aggregate crash dumps
and stack tracebacks, in bulk, from their customers' customers' failed
programs, so that the causative bug can be identified and fixed.

------
dragontamer
> In fact, perf showed that the query server as a whole was spending 4% of its
> CPU time on one instruction in that loop:

Looks like 40% to me actually. I think there's a typo here. EDIT: The tweet is
more clear. 4% of the total CPU time for the whole program. The 40% in that
line is only for that particular function (I presume the function is taking
~10% total CPU time, and 40% of that is on that singular instruction)

\------

> The first thing to note is that instruction-level profiling tends to put the
> blame on the instruction following the one that triggered a sampling
> interrupt.

Hmm, in my experience with AMD-based hardware profiling, AMD CPUs may place
the blame as far back as ~200 instructions (pathological worst-case) behind
the instruction that was interrupted.

This is most obvious on AMD branch mispredictions. A branch mispredict can, by
definition, only happen on branch instructions (a cmp/jmp pair, since cmp/jmp
turn into a single double-instruction micro-op on both AMD and Intel). But
you'll find mispredictions "blamed" on all sorts of non-branching instructions
on AMD-hardware. Its a known limitation of AMD's hardware performance
counters.

In the first case, the modvqu is almost certainly where the 40% of the time is
being spent... later on there is...

    
    
         9.91 |       lea        (%r12,%12,4),%rax
         0.64 |       prefetcht0 (%rdx,%rax,8)
        17.04 |       cmp        %rcx,0x28(%rsp)
    

I can't imagine that the prefetch0 instruction is taking up 17% of the time of
the function. I expect that the 17% here must be something else. In
particular, Agner Fog's manuals state that PREFETCHINTA/0/1/2 has 1-latency
and 2-instructions/clock throughput: a very fast instruction under Agner Fog's
test case on both Intel Skylake and AMD Zen.

AMD has "instruction base sampling", a difficult to understand but more
accurate profiling methodology. Intel has PEBS, precision event-based
sampling, which helps narrow down the specific instruction to "blame".

This blogpost doesn't say which CPU its using, but unfortunately, these little
details do matter. Ultimately, I'd investigate more before concluding the
prefetcht0 instruction is actually the problem.

If Store Forwarding might be an issue, then LD_BLOCKS.STORE_FORWARD event on
Intel Skylake would be a performance counter to check.

\--------------

> and accelerate random accesses to the hash table with software prefetching.

Does prefetching actually help in this case? Have you tested the code without
prefetching?

Without prefetching, a MODERN processor will NOT "stall" on the load
instruction. A modern processor will execute out-of-order, branch predict the
next iteration (its an inner loop, so its probably always taken), and then
effectively achieve the same throughput anyway.

I wouldn't necessarily be "surprised" if prefetching helps in this case... but
its not necessarily a given.

~~~
anarazel
> Without prefetching, a MODERN processor will NOT "stall" on the load
> instruction. A modern processor will execute out-of-order, branch predict
> the next iteration (its an inner loop, so its probably always taken), and
> then effectively achieve the same throughput anyway.

It's a hash-table insertion. That's usually poorly predictable.

The post explicitly talks about out of order execution, and why it's of
limited help here. See the second "Hypothesis" paragraph. And then talks about
how to change the code to allow to utilize out of order execution.

~~~
dragontamer
> The post explicitly talks about out of order execution...

I understand that. But there's no time in the post that demonstrates that
prefetching was useful or not.

> It's a hash-table insertion. That's usually poorly predictable.

If he is always inserting 8 elements, the CPU (might) be able to figure out
that it needs to insert 8-elements in that loop.

The hash-table insertion itself will fail to be predicted. But there's nothing
the programmer can really do about that. However, by having a constant
8-buffer sized "insertion batch", you provide an easily predicted loop to the
CPU. There will almost always be 8-elements inserted into the hash table, and
CPUs can accurately predict constant-sized loops under 20-iterations or so...
maybe 30-iterations depending on how recent your CPU is.

These 8x insertions WILL be predicted. That means your CPU will be OoO
executing these 8x insertions. Each insertion may be unpredictable from a
branch-prediction perspective (rehash 0x, 1x, or 2x? Depends on the hash
table, probably unpredictable), but that doesn't change the dependency graph,
so I expect the CPU to OoO the next iteration anyway.

\--------------

In any case, I think prefetching is an open question still. I think this is a
good blog post, but I'm curious if the author did any prefetching experiments.

~~~
atq2119
> Each insertion may be unpredictable from a branch-prediction perspective
> (rehash 0x, 1x, or 2x? Depends on the hash table, probably unpredictable),
> but that doesn't change the dependency graph, so I expect the CPU to OoO the
> next iteration anyway.

This does matter, actually. As soon as the first branch inside an individual
item's insertion is mispredicted, all the following instructions will be
discarded and must be re-executed. The out-of-order logic in the CPU cannot
tell that there is a subset of inflight instructions that will be executed
regardless of the outcome of the branch.

The re-execution is likely cheaper though, because the earlier speculative
execution ensures that the relevant caches are already populated.

------
aratakareigen
I just started reading, but why does it say "(semi)group" instead of
"semigroup / monoid"? I've never heard of inversion being used to perform a
fold.

