Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For anybody unfamiliar with this, as I was, this appears to refer to Intel's Indirect Branch Tracking feature[1] (and the equivalent on ARM, BTI). The idea is that an indirect branch can only pass control to a location that starts with an "end branch" instruction. An indirect branch is one that jumps to a location whose value is loaded or computed from either a register or memory address: think calling a function pointer in C.

Without IBT, you'd have this equivalence between C and assembly:

    main() {
        void (*f)();
        f = foo;
        f();
    }

    void foo() { }

    ---

    main:
        movl $foo, %edx
        call *%edx
        ret

    foo:
        ret
If IBT is enabled, the above code triggers an exception because foo doesn't begin with an "end branch" instruction. When IBT is enabled by the compiler, the above code gets assembled as:

    main:
        endbr64 
        movl $foo, %edx
        call *%edx
        ret

    foo:
        endbr64
        ret
Now the compiler inserts endbr64 at the start of each function prologue. The reason for this feature, is to use as a defense in depth against JOP, and COP attacks, as it means that the only "widgets" available to you are entire functions, which can be far harder to exploit and chain.

[1]: https://www.intel.com/content/dam/develop/external/us/en/doc...



The fun fact being that older CPUs decode ENDBR64 as a slightly weird NOP (with no architectural effects), but it'll fault on original Pentiums: https://stackoverflow.com/questions/56120231/how-do-old-cpus...


There's a good question in the comments there that I still don't see the answer to. How does this work if there's an interrupt between the branch and the endbranch? Does the OS need to save/restore the "branchness" bit?


Yes, on arm the branch type is saved in SPSR_EL1 in the BTYPE field. That stands for Saved Program State Register for Kernel Mode (Exception Level 1) and Branch Type. https://developer.arm.com/documentation/ddi0595/2021-12/AArc...


there is no branchness bit, if there's an endbranch you can jump to it


Ah so when you return from an interrupt, the check is no longer done?


I'd assume so since it wouldn't be a call/jmp coming from a computed address in a register. That said I haven't read the documentation for any of this. But interrupts should be having a stack pointer change and other things happening that would be different, which is why they use the IRET instruction and not the RET one.


Various architectures do other interesting things with NOPs, IIRC one convention on PowerPC had something vaguely related to debugging or tracing (I can't remember the details or find any references right now).


Not just architectures, but different OSes and ABIs have found ways to repurpose no-ops. One example[1] is Windows using the 2-byte "MOV EDI, EDI" as a hot-patch point: it gets replaced by a "JMP $-5" instruction which jumps 5 bytes before the start of a function into a spot reserved for patching. That 5 bytes is enough to contain a full jump instruction that can then jump wherever you need it to.

## Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?

[1]: https://devblogs.microsoft.com/oldnewthing/20110921-00/?p=95...


Interesting, thanks for pointing this out! Just yesterday I was gazing at some program containing two consecutive xor rax, rax. I thought what’s the point? But as you point out it might be a NOP sled designed to be that specific length.


That would be surprising. xor is often used like that to set a register to 0, which is far from a nop. I'm not sure why it would do it twice, but it might be as simple as the compiler being stupid.


The second one is effectively a nop though.

The fact that it’s xor rax, rax rather than xor eax, eax is also interesting as it’s one byte longer for exactly the same effect (modifying the bottom 32 bits of a register clears the upper 32 bits). It makes me think there’s something weird going on other than compiler stupidity. I’d be interested in seeing the code it was compiled from.


I wonder if this is still true. Whenever I go to hook Win32 API functions, I use an off-the-shelf length disassembler to create a trampoline with the first n bytes of instructions and a jmp back, and then just patch in a jmp to my hook, but if this hot-patch point exists it'd be a lot less painful since you can avoid basically all of that.

Though, I guess even if it was, it'd be silly to rely on it even on x86 only. Maybe it would still make for a nice fast-path? Dunno.


Good read. Thank you.

This just worsens my fear of changing "unnecessary" code when I don't know the original motivation for it.


Intel Vtune will do this with 5-byte NOPs directly. I think LLVM's x-ray tracing suite did this with a much bigger NOP, also, to capture more information.


RISC-V has a whole HINT space that's basically just morphs of load immediate into zero register.

AArch64 has a similar space: https://developer.arm.com/documentation/ddi0596/2020-12/Base...

And yes, PowerPC has a similar space as well holding hints like 'give priority to the other hardware threads on this core' and the like. https://utcc.utoronto.ca/~cks/space/blog/tech/PowerPCInstruc...


I was wondering where did I read about PowerPC, and this is exactly the article! So, it was for thread priority. Strikes me as an odd design choice, this probably should've been something to be managed by the OS more explicitly.


I think the idea of exposing it to user space is to better handle concurrency before trapping into the kernel.

So consider the case of a standard mutex in the contended case. Normally the code will spin for a little bit before informing the kernel scheduler on the off chance that the thread that owns the lock is currently scheduled on another hardware thread. In that case it's in the best interest of the thread trying to grab the lock to shift most of the intracore priority to any other hardware threads so that it can potentially help the other hardware thread holding the lock get to a point where it gives up the lock quicker.


https://www.ibm.com/docs/en/aix/7.3?topic=h-hpmstat-command:

“random_samp_ele_crit=name

Specifies the random criteria for selecting the instructions for sampling. Valid values for this option are as follows:

ALL_INSTR

All instructions are eligible. This value is the default setting.

LOAD_STORE

The operation is routed to the Load Store Unit (LSU); for example, load, store.

PROB_NOP

Sample only special no-operation instructions, which are called Probe NOP events.

[…]”


Some MIPS cores had a superscalar NOP that would stall every ALU by one cycle, which was necessary because they lacked synchronization instructions.


That’s really clever use of the opcode space. Thanks for passing that along.


NOP on intels is in fact xchg eax, eax


It was an old joke that the opposite of "goto" is "come from", or that if goto is considered harmful, nobody said anything about a "come from". Marking something as a branch target reminds me of this.

https://en.m.wikipedia.org/wiki/COMEFROM


> GOTO considered harmful

COMEFROM considered harm-mitigating

It ingeniously makes Return Oriented Programming (ROP) a lot harder.


> COMEFROM considered harm-mitigating

You know, that’d be a fantastic OpenBSD release name.

Here’s hoping a dev sees this comment; there’s already been a few commenting in this thread.


Interesting. Seems like enforcement on Intel CPUs is supported since Tiger Lake (so ~2020). Windows has basically the same feature implemented in software since 2015, called Control Flow Guard [1]. I wonder what the story there is, and if Windows has any plans to (get everyone to) switch to the hardware version once those CPUs have sufficient market share.

1: https://learn.microsoft.com/en-us/windows/win32/secbp/contro...


Windows also recently implemented a far better version of this called Extended Flow Guard (XFG) that not only checks whether the location is a valid destination, but also whether it's a valid destination for that specific source.

For example, for any virtual function call or function pointer call, the destination must have a correct tag with the hash of the arguments. It's much more secure, and also faster, since loading the tag from memory can be merged with loading the actual code after it.

I wish this was the one implemented in hardware..


There’s a great article on XFG here [0] but it observed that a failed XFG check downgrades to a regular CFG check instead of a denial.. meaning it adds zero extra protection? Perhaps this behavior has changed since the preview they tested, though!

[0]: https://www.offsec.com/offsec/extended-flow-guard/


That can't be right, it would be entirely pointless then. It looks like the article was written during a pre-release time, so maybe it wasn't fully enabled?

I've not yet been able to use XFG in any production software, due to the requirement of rebuilding every static linked library with it enabled. But it didn't seem to fall back to CFG when I was testing it in a toy program.


That does sound like it would be more robust, but definitely sounds like it'd require a lot more silicon than the IBT that they did implement. Something like it might be something that comes in some future revisions.


ARM does it!


Interesting. I was able to get Clang to generate this using `-fcf-protection=branch`: https://godbolt.org/z/rooP8vPsM

It looks like endbr64 is a 4-byte instruction. That could be a significant code size overhead for jump tables with lots of targets: https://godbolt.org/z/xTPToaddh


OpenBSD disables jump tables in Clang on amd64 due to IBT, some architectures also had jump tables disabled as part of the switch to --execute-only ("xonly") binaries by default, e.g: powerpc64/sparc64/hppa.

https://marc.info/?l=openbsd-cvs&m=168254711511764&w=2

E.g: https://marc.info/?l=openbsd-cvs&m=167337396024167&w=2


Any idea what the performance impact is?


Why should every function start with endbr64 command? Aren't functions usually called directly?

Also, is it required to insert endbr64 command after function calls (for return address)?


As to why they're not always called directly, imagine some code like this:

    int FooWithoutChecks(void *p);
    
    int Foo(void *p) {
      if (p == NULL) return -1;
      return FooWithoutChecks(p);
    }
In general the caller is expected to call Foo if they aren't sure if the pointer is nullable, or if they already know that pointer is not null (e.g. because they already checked it themselves) they can call FooWithoutChecks and avoid a null check that they know will never be true.

The naive way to emit assembly for this is to actually emit two separate functions, and have Foo call FooWithoutChecks the usual way. But notice that the FooWithoutChecks function call is a tail call, so the compiler can use tail call optimization. To do this it would inline FooWithoutChecks into Foo itself, so the compiler just emits code for Foo with the logic in FoowithoutChecks inlined into Foo. This is nice because now when you call Foo, you avoid a call/ret instruction, so you save two instructions on every call to Foo. But what if someone calls FooWithoutChecks? Simple, you just call at the offset into Foo just past the pointer comparison. This actually just works because Foo already has a ret instruction, so the call to FooWithoutChecks will just reuse the existing ret. This optimization also saves some space in the binary which has various benefits in and of itself.

The example here with the null pointer check is kind of contrived, but this kind of pattern happens a LOT in real code when you have a small wrapper function that does a tail call to another function, and isn't specific to pointer checks.


> Why should every function start with endbr64 command? Aren't functions usually called directly?

They're usually called directly, but unless the compiler can prove that they always are (e.g., if they're static and nothing in the same file takes the address), endbr64 is required.

> Also, is it required to insert endbr64 command after function calls (for return address)?

No, IBT is only for jmp and call. SS is the equivalent mechanism for ret.


> but unless the compiler can prove that they always are (e.g., if they're static and nothing in the same file takes the address), endbr64 is required

Then why not just have the compiler break down every non-static function into two blocks: a static function that contains all the logic, and a non-static function that just contains an IBT and a direct jump to the static function? (Or, better yet, place the non-static label just before the static one, and have the non-static fall through into the body of the static.) Then the static direct callsites won't have to pay the overhead of executing the IBT NOP.


The IBT NOP is "free" in that it will evaporate in the pipeline; it still has to be fetched and decoded to some extent, but it does not consume execution resources.

From a tooling perspective, what you're describing (two entrypoints for a function, the jump you mention is pointless) would require changes up and down the toolchain; it would affect the compiler, all linkers, all debuggers, etc. By contrast, just adding an additional instruction to the function prolog is relatively low-impact.

It's also worth noting that at the time code for a function is emitted, the compiler is not aware of whether the symbol will be exported and thus discoverable in some other module, or by symbol table lookup, so emitting the target instruction is essentially mandatory.


Doesn't seem like it'd be that difficult to make the change the other direction, i.e. keep endbr64 as-is as the default case, but if there's a direct jump/call to anywhere that starts with endbr64, offset the immediate by 4 bytes; could be done in any single stage of toolchain that has that info with no extra help. But yeah, quite low impact, might not even affect decode throughput & cache usage for at least one of the direct or indirect cases.


> Doesn't seem like it'd be that difficult

Show me the code -- better yet, submit it to the relevant projects! :)


That's absolutely doable, just... How much is predicted unconditional jump slower/faster than ENDBR64? What's the ratio of virtual/static calls in real-world programs? And while your last proposal ("foo: endbr64; foo_internal: <code>") evades those questions, it raises up questions about maintaining function alignment (16 bytes IIRC? Is this even necessary today?) and restructuring the compiler to distinguish the inner/external symbol addresses. Plus, of course, somebody has to actually sit down and write the code to implement that, as opposed to just adding "if (func->is_escaping) emit_endbr(...);" at the beginning of the code that emits the object code for a function body.


That sounds a lot like “add a prefix to the function with an endbr64 instruction”.


What is the overhead of executing the IBT NOP?


It's not "executed" per se. It consumes space in the cache hierarchy, and a slot in the front-end decoder. It won't ever be issued, but depending on the microarchitecture in question it might result in an issue cycle having less occupancy than it might have had in the case where the subsequent instruction was available.

With that said, the first few instructions of a called function often stall due to stack pointer dependencies, etc. so the true execution cost is likely to be even smaller than the above might suggest.


C allows for any function to be called via a function pointer, and functions can be in different translation units, so the compiler can't simply assume that a function will never be called indirectly and has to pessimistically insert endbr64 in order to maintain a reasonable ABI.

And no, as I understand it, this is only for branch/calls not returns.


Well, if the function is marked "static", the compiler can actually check whether the function's address is taken in the current compilation unit or not and omit/emit ENDBR64 accordingly (passing pointers to static functions to code in another compilation units is legal, and should still work).


Good catch. Yeah, as long as the functions address is never taken the compiler has a lot of leeway with static functions; it can even avoid emitting code for them entirely if it can prove they're never called or if it's able to compute their results at compile-time.


Yep. Or inline them at every call site if that makes sense to do based on the optimization level and flags.


Is this theoretically something lto could remove?


If you disable dlopen and ld_preload.


Dlopen() "sees" only functions marked as exported (with macro like DLLEXPORT on Windows), not every function or am I wrong? Is C that bad?


On openbsd at least, every global symbol is exported unless you use an explicit symbol list. It's unusual for executables.


A traditional compiler needs to insert them for all external functions, because other compilation units may make an indirect call.


In case anyone wants a very simple introduction to JOP/COP exploits and mitigations of this type: <https://www.theregister.com/2020/06/15/intel_cet_tiger_lake/>


Thank you for the explanation!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: