
Alternatives to short unconditional jumps on x86 - jsnell
http://www.corsix.org/content/lxi-trick
======
innocenat
Are there problem with CPU cache management? IIRC Intel Sandy Bridge and later
generations has micro-ops cache (which lie after the decoder). Wouldn't this
and other instruction-level polymorphism tricks severely mess up such cache?

~~~
Cyph0n
The issue with jumps/branches during instruction fetch is that _the CPU doesn
't 100% know whether an instruction is a branch or not_. OK, how is this an
issue? Well, the instruction prefetcher will end up fetching useless
instructions!

So "modern" (1980s+) CPUs add a branch predictor[1] to the fetch stage. The
job of the branch predictor is to determine two things:

(1) Is this instruction a branch? [2]

(2) If it's a branch, is it taken or not?

For the case of unconditional jumps, we can just focus on (1). Therefore,
given that the branch predictor is never 100% accurate, even an unconditional
branch can result in garbage instructions fetched into the CPU pipeline. So
replacing some key unconditional jumps might in fact improve performance.

[1]: Actually, I was lying. There is also the BTB, collapsing buffer, and
trace cache.

[2]: The fetch stage doesn't know what an instruction is; that's the job of
the decode stage.

~~~
innocenat
I understand your point. However, branch prediction is so good nowadays that
it rarely matter, especially for short branch [1]

My main concern about trace cache/micro-ops buffer is how it would act when it
encounter an instruction boundary change? Would it get flushed? Decoding can
be bottleneck for some workload (e.g. SIMD Integer).

[1]:
[http://yarchive.net/comp/linux/cmov.html](http://yarchive.net/comp/linux/cmov.html)

~~~
anarazel
Branch prediction has gotten a lot better, that's true. Especially by taking
more preceding branches into account.

But saying that branch mispredictions rarely matter seems a _way_ too strong
statement. In a lot of halfway performance critical software a large number of
pipeline stalls are caused by wrongly predicted branches, and utilizing the
pipeline efficiently is IMO becoming Mir, not less, important.

------
rwmj
It would be nice to know when its appropriate to use these. An unconditional
jump has no pipeline penalty and doesn't require speculative execution. These
alternative jumps save a single byte per if statement (making code size a tiny
bit smaller, and slightly reducing pressure on the I-cache).

Is there a risk of introducing a false dependency? I think the answer is no,
because if the following code depends on a flag value or the exact contents of
%rax then it's probably incorrect.

Maybe modern processors wouldn't like seeing the same instruction bytes
decoding to two different micro-ops? That might trigger some horrible slow
fallback path.

~~~
mikeash
It will probably confuse disassemblers and debuggers. Which might be a bonus,
depending on what you're after.

~~~
corsix
IDA certainly complains a lot when it sees this.

------
userbinator
Techniques like these were extremely common on the NES and other game consoles
of the time, where packing as much code into the ROM (fixed-size and expensive
to expand) as possible was important. Embedded systems of the same era, and
even PC software would often contain them too.

~~~
seeekr
So are these techniques only about reducing code size and therefore better
utilization of instruction cache? Or is there some penalty to unconditional
jumps that may be avoided using such techniques?

------
TwoBit
I don't understand what the author is talking about. He's writing as if people
are familiar with this already or can read his mind (the #1 failure of all
technical writing). I say this as somebody who understands x86 assembly. What
is the following trying to convey?

"For example, jmp $+1 encodes as EB 01 ?? where ?? is the one byte to be
jumped over. If burning a register is an option, then mov al, imm8 (encoded as
B0 ??) might be an alternative (that is, the byte being jumped over becomes
the imm8 value)."

~~~
tedunangst
Yeah, it's a little opaque. I'll try to rephrase. You have some code, with an
if and else. if (cond) X else Y. After that comes code Z. When you compile
this, the compiler emits instructions to check the condition, and if it's
zero, jump to the else. Otherwise it proceeds to run X. At the end of X, you
need to add another jump over Y to Z. Otherwise you'd execute Y too.

Having lots of little jumps in your code can slow things down (well, maybe,
depending on CPU, etc., but stipulate that jumps are bad). So at the end of X,
you have to have a jump over the length of Y. However, if Y is very short,
like 1 or 4 bytes, you don't actually need to use a jump instruction. Instead,
pretend the Y instruction is a constant, and load it into a register (and
ignore it). The result is the CPU runs X, loads some ignored data into a
register, then runs Z. No jumps.

------
corsix
Thanks for the hug of death - I've turned on more aggressive CloudFlare
caching.

~~~
mikeash
It took forever to load the first time I tried it, but it's fast now.

