Hacker News new | past | comments | ask | show | jobs | submit login

> The dispatch loop takes a single indirect branch to the opcode-specific implementation. This means that the branch will be nigh unpredictable!

Modern branch predictors can actually predict indirect branches with multiple destinations, because they hash recent branch history into the prediction. The exact same indirect branch will end up with multiple BTB entries, based on previous control flow.

I was curious where the threaded code speedup is actually coming from. It's possible much of the speedup is coming from eliminating the extra branch back to the dispatch loop. Or maybe the history tracking in the M1's branch predictor doesn't work well for this type of control flow.

So I checked out this code and modified the next macro, adding a dummy branch over a nop, to roughly isolate this factor.

On my M1, the extra unconditional branch benchmarks at 1.08 sec for fib, and 1.19 sec for Mandelbrot (all other times were the same).

Looks like the speedup is a mixture of the two. Eliminating that extra branch is responsible for about 20-30% of the speedup, and improving prediction on indirect branches is responsible for the other 70-80%




Ok, I spent quite a bit of time looking at performance counters, trying to understand what the M1's branch predictor was doing.

The branch predictor is really accurate with a common dispatcher, it predicts those indirect branches correctly 99.25% of the time. Switching to threaded jumps improves this slightly to 99.75%, but not because the indirect branches are at different addresses. This improvement in accuracy is entirely because of the unconditional branch back the dispatcher that was removed as a side effect. With that gone, the branch predictor can now track branch histories that are about twice as many VM instructions long.

My modified version with a dummy branch in the threaded dispatcher negates this longer history and (according to the performance counters) results in the exact same 99.25% correctly predicted branches as the common dispatcher, yet it's still significantly faster, only 20% slower than threaded jumps.

-------------------

Why are threaded jumps faster on the M1, if it's not increasing branch prediction accuracy?

Well, the M1 essentially has two branch predictors[1]. The faster one can return a prediction in one cycle, but it's not checking branch history it all, and it's almost always wrong for these unpredictable indirect branches. The slower predictor does take branch history into account, but takes three cycles to produce the correct result.

Which means there is a short pipeline stall when the second prediction comes in. This stall doesn't show up as a BRANCH_INDIR_MISPRED_NONSPEC, because the branch was correctly predicted. Instead, it seems to show up in the FETCH_RESTART counter.

So while threaded jumps doesn't improve overall branch prediction accuracy (except because of the longer history), it does slightly improve the accuracy of the faster branch predictor. With a common dispatcher, it predicts wrong almost 100% of the time, but with threaded code, the accuracy improves to 40%.

[1] Or at least that's a decent mental model. I suspect it's actually be a single ITTAGE branch predictor that returns the zero-history result in 1 cycle


Very cool, thanks for digging into this!


The other thing I noticed while playing around; Performance absolutely falls off a cliff if your hot loop starts missing in the L1i cache.

This blog post [1] from CloudFlare has a few interesting hints about the M1's branch predictor. First, (in their testing) to get the one cycle predictions at all, your hot code needs to fit in just 4KB.

Second, if the brach target isn't in L1 cache, you don't get a prediction at all. The branch target prediction probably points directly at the cache line + way, so even if a cache line moves to a different way, the prediction will still fail.

Which means, I'm not sure this optimisation is worth while. It works for fibonacci and mandelbrot because they have reasonably tight loops, and adding a dispatcher to each instruction handler doesn't push the hot code over the limit. But when interpreting more generic code, you might be better off trying to minimise cache usage.

[1] https://blog.cloudflare.com/branch-predictor




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: