For anybody unfamiliar with this, as I was, this appears to refer to Intel's Indirect Branch Tracking feature[1] (and the equivalent on ARM, BTI). The idea is that an indirect branch can only pass control to a location that starts with an "end branch" instruction. An indirect branch is one that jumps to a location whose value is loaded or computed from either a register or memory address: think calling a function pointer in C.
Without IBT, you'd have this equivalence between C and assembly:
main() {
void (*f)();
f = foo;
f();
}
void foo() { }
---
main:
movl $foo, %edx
call *%edx
ret
foo:
ret
If IBT is enabled, the above code triggers an exception because foo doesn't begin with an "end branch" instruction. When IBT is enabled by the compiler, the above code gets assembled as:
main:
endbr64
movl $foo, %edx
call *%edx
ret
foo:
endbr64
ret
Now the compiler inserts endbr64 at the start of each function prologue. The reason for this feature, is to use as a defense in depth against JOP, and COP attacks, as it means that the only "widgets" available to you are entire functions, which can be far harder to exploit and chain.
There's a good question in the comments there that I still don't see the answer to. How does this work if there's an interrupt between the branch and the endbranch? Does the OS need to save/restore the "branchness" bit?
I'd assume so since it wouldn't be a call/jmp coming from a computed address in a register. That said I haven't read the documentation for any of this. But interrupts should be having a stack pointer change and other things happening that would be different, which is why they use the IRET instruction and not the RET one.
Various architectures do other interesting things with NOPs, IIRC one convention on PowerPC had something vaguely related to debugging or tracing (I can't remember the details or find any references right now).
Not just architectures, but different OSes and ABIs have found ways to repurpose no-ops. One example[1] is Windows using the 2-byte "MOV EDI, EDI" as a hot-patch point: it gets replaced by a "JMP $-5" instruction which jumps 5 bytes before the start of a function into a spot reserved for patching. That 5 bytes is enough to contain a full jump instruction that can then jump wherever you need it to.
## Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?
Interesting, thanks for pointing this out! Just yesterday I was gazing at some program containing two consecutive xor rax, rax. I thought what’s the point? But as you point out it might be a NOP sled designed to be that specific length.
That would be surprising. xor is often used like that to set a register to 0, which is far from a nop. I'm not sure why it would do it twice, but it might be as simple as the compiler being stupid.
The fact that it’s xor rax, rax rather than xor eax, eax is also interesting as it’s one byte longer for exactly the same effect (modifying the bottom 32 bits of a register clears the upper 32 bits). It makes me think there’s something weird going on other than compiler stupidity. I’d be interested in seeing the code it was compiled from.
I wonder if this is still true. Whenever I go to hook Win32 API functions, I use an off-the-shelf length disassembler to create a trampoline with the first n bytes of instructions and a jmp back, and then just patch in a jmp to my hook, but if this hot-patch point exists it'd be a lot less painful since you can avoid basically all of that.
Though, I guess even if it was, it'd be silly to rely on it even on x86 only. Maybe it would still make for a nice fast-path? Dunno.
Intel Vtune will do this with 5-byte NOPs directly. I think LLVM's x-ray tracing suite did this with a much bigger NOP, also, to capture more information.
I was wondering where did I read about PowerPC, and this is exactly the article! So, it was for thread priority. Strikes me as an odd design choice, this probably should've been something to be managed by the OS more explicitly.
I think the idea of exposing it to user space is to better handle concurrency before trapping into the kernel.
So consider the case of a standard mutex in the contended case. Normally the code will spin for a little bit before informing the kernel scheduler on the off chance that the thread that owns the lock is currently scheduled on another hardware thread. In that case it's in the best interest of the thread trying to grab the lock to shift most of the intracore priority to any other hardware threads so that it can potentially help the other hardware thread holding the lock get to a point where it gives up the lock quicker.
It was an old joke that the opposite of "goto" is "come from", or that if goto is considered harmful, nobody said anything about a "come from". Marking something as a branch target reminds me of this.
Interesting. Seems like enforcement on Intel CPUs is supported since Tiger Lake (so ~2020). Windows has basically the same feature implemented in software since 2015, called Control Flow Guard [1]. I wonder what the story there is, and if Windows has any plans to (get everyone to) switch to the hardware version once those CPUs have sufficient market share.
Windows also recently implemented a far better version of this called Extended Flow Guard (XFG) that not only checks whether the location is a valid destination, but also whether it's a valid destination for that specific source.
For example, for any virtual function call or function pointer call, the destination must have a correct tag with the hash of the arguments. It's much more secure, and also faster, since loading the tag from memory can be merged with loading the actual code after it.
There’s a great article on XFG here [0] but it observed that a failed XFG check downgrades to a regular CFG check instead of a denial.. meaning it adds zero extra protection? Perhaps this behavior has changed since the preview they tested, though!
That can't be right, it would be entirely pointless then. It looks like the article was written during a pre-release time, so maybe it wasn't fully enabled?
I've not yet been able to use XFG in any production software, due to the requirement of rebuilding every static linked library with it enabled. But it didn't seem to fall back to CFG when I was testing it in a toy program.
That does sound like it would be more robust, but definitely sounds like it'd require a lot more silicon than the IBT that they did implement. Something like it might be something that comes in some future revisions.
It looks like endbr64 is a 4-byte instruction. That could be a significant code size overhead for jump tables with lots of targets: https://godbolt.org/z/xTPToaddh
OpenBSD disables jump tables in Clang on amd64 due to IBT, some architectures also had jump tables disabled as part of the switch to --execute-only ("xonly") binaries by default, e.g: powerpc64/sparc64/hppa.
As to why they're not always called directly, imagine some code like this:
int FooWithoutChecks(void *p);
int Foo(void *p) {
if (p == NULL) return -1;
return FooWithoutChecks(p);
}
In general the caller is expected to call Foo if they aren't sure if the pointer is nullable, or if they already know that pointer is not null (e.g. because they already checked it themselves) they can call FooWithoutChecks and avoid a null check that they know will never be true.
The naive way to emit assembly for this is to actually emit two separate functions, and have Foo call FooWithoutChecks the usual way. But notice that the FooWithoutChecks function call is a tail call, so the compiler can use tail call optimization. To do this it would inline FooWithoutChecks into Foo itself, so the compiler just emits code for Foo with the logic in FoowithoutChecks inlined into Foo. This is nice because now when you call Foo, you avoid a call/ret instruction, so you save two instructions on every call to Foo. But what if someone calls FooWithoutChecks? Simple, you just call at the offset into Foo just past the pointer comparison. This actually just works because Foo already has a ret instruction, so the call to FooWithoutChecks will just reuse the existing ret. This optimization also saves some space in the binary which has various benefits in and of itself.
The example here with the null pointer check is kind of contrived, but this kind of pattern happens a LOT in real code when you have a small wrapper function that does a tail call to another function, and isn't specific to pointer checks.
> Why should every function start with endbr64 command? Aren't functions usually called directly?
They're usually called directly, but unless the compiler can prove that they always are (e.g., if they're static and nothing in the same file takes the address), endbr64 is required.
> Also, is it required to insert endbr64 command after function calls (for return address)?
No, IBT is only for jmp and call. SS is the equivalent mechanism for ret.
> but unless the compiler can prove that they always are (e.g., if they're static and nothing in the same file takes the address), endbr64 is required
Then why not just have the compiler break down every non-static function into two blocks: a static function that contains all the logic, and a non-static function that just contains an IBT and a direct jump to the static function? (Or, better yet, place the non-static label just before the static one, and have the non-static fall through into the body of the static.) Then the static direct callsites won't have to pay the overhead of executing the IBT NOP.
The IBT NOP is "free" in that it will evaporate in the pipeline; it still has to be fetched and decoded to some extent, but it does not consume execution resources.
From a tooling perspective, what you're describing (two entrypoints for a function, the jump you mention is pointless) would require changes up and down the toolchain; it would affect the compiler, all linkers, all debuggers, etc. By contrast, just adding an additional instruction to the function prolog is relatively low-impact.
It's also worth noting that at the time code for a function is emitted, the compiler is not aware of whether the symbol will be exported and thus discoverable in some other module, or by symbol table lookup, so emitting the target instruction is essentially mandatory.
Doesn't seem like it'd be that difficult to make the change the other direction, i.e. keep endbr64 as-is as the default case, but if there's a direct jump/call to anywhere that starts with endbr64, offset the immediate by 4 bytes; could be done in any single stage of toolchain that has that info with no extra help. But yeah, quite low impact, might not even affect decode throughput & cache usage for at least one of the direct or indirect cases.
That's absolutely doable, just... How much is predicted unconditional jump slower/faster than ENDBR64? What's the ratio of virtual/static calls in real-world programs? And while your last proposal ("foo: endbr64; foo_internal: <code>") evades those questions, it raises up questions about maintaining function alignment (16 bytes IIRC? Is this even necessary today?) and restructuring the compiler to distinguish the inner/external symbol addresses. Plus, of course, somebody has to actually sit down and write the code to implement that, as opposed to just adding "if (func->is_escaping) emit_endbr(...);" at the beginning of the code that emits the object code for a function body.
It's not "executed" per se. It consumes space in the cache hierarchy, and a slot in the front-end decoder. It won't ever be issued, but depending on the microarchitecture in question it might result in an issue cycle having less occupancy than it might have had in the case where the subsequent instruction was available.
With that said, the first few instructions of a called function often stall due to stack pointer dependencies, etc. so the true execution cost is likely to be even smaller than the above might suggest.
C allows for any function to be called via a function pointer, and functions can be in different translation units, so the compiler can't simply assume that a function will never be called indirectly and has to pessimistically insert endbr64 in order to maintain a reasonable ABI.
And no, as I understand it, this is only for branch/calls not returns.
Well, if the function is marked "static", the compiler can actually check whether the function's address is taken in the current compilation unit or not and omit/emit ENDBR64 accordingly (passing pointers to static functions to code in another compilation units is legal, and should still work).
Good catch. Yeah, as long as the functions address is never taken the compiler has a lot of leeway with static functions; it can even avoid emitting code for them entirely if it can prove they're never called or if it's able to compute their results at compile-time.
Without IBT, you'd have this equivalence between C and assembly:
If IBT is enabled, the above code triggers an exception because foo doesn't begin with an "end branch" instruction. When IBT is enabled by the compiler, the above code gets assembled as: Now the compiler inserts endbr64 at the start of each function prologue. The reason for this feature, is to use as a defense in depth against JOP, and COP attacks, as it means that the only "widgets" available to you are entire functions, which can be far harder to exploit and chain.[1]: https://www.intel.com/content/dam/develop/external/us/en/doc...