For anybody unfamiliar with this, as I was, this appears to refer to Intel's Ind...

rwmj · on July 14, 2023

The fun fact being that older CPUs decode ENDBR64 as a slightly weird NOP (with no architectural effects), but it'll fault on original Pentiums: https://stackoverflow.com/questions/56120231/how-do-old-cpus...

dataflow · on July 14, 2023

There's a good question in the comments there that I still don't see the answer to. How does this work if there's an interrupt between the branch and the endbranch? Does the OS need to save/restore the "branchness" bit?

muricula · on July 14, 2023

Yes, on arm the branch type is saved in SPSR_EL1 in the BTYPE field. That stands for Saved Program State Register for Kernel Mode (Exception Level 1) and Branch Type. https://developer.arm.com/documentation/ddi0595/2021-12/AArc...

drdrey · on July 14, 2023

there is no branchness bit, if there's an endbranch you can jump to it

dataflow · on July 14, 2023

Ah so when you return from an interrupt, the check is no longer done?

simcop2387 · on July 14, 2023

I'd assume so since it wouldn't be a call/jmp coming from a computed address in a register. That said I haven't read the documentation for any of this. But interrupts should be having a stack pointer change and other things happening that would be different, which is why they use the IRET instruction and not the RET one.

rollcat · on July 14, 2023

Various architectures do other interesting things with NOPs, IIRC one convention on PowerPC had something vaguely related to debugging or tracing (I can't remember the details or find any references right now).

messe · on July 14, 2023

Not just architectures, but different OSes and ABIs have found ways to repurpose no-ops. One example[1] is Windows using the 2-byte "MOV EDI, EDI" as a hot-patch point: it gets replaced by a "JMP $-5" instruction which jumps 5 bytes before the start of a function into a spot reserved for patching. That 5 bytes is enough to contain a full jump instruction that can then jump wherever you need it to.

## Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?

[1]: https://devblogs.microsoft.com/oldnewthing/20110921-00/?p=95...

jeffbee · on July 14, 2023

Interesting, thanks for pointing this out! Just yesterday I was gazing at some program containing two consecutive xor rax, rax. I thought what’s the point? But as you point out it might be a NOP sled designed to be that specific length.

gizmo686 · on July 15, 2023

That would be surprising. xor is often used like that to set a register to 0, which is far from a nop. I'm not sure why it would do it twice, but it might be as simple as the compiler being stupid.

messe · on July 15, 2023

The second one is effectively a nop though.

The fact that it’s xor rax, rax rather than xor eax, eax is also interesting as it’s one byte longer for exactly the same effect (modifying the bottom 32 bits of a register clears the upper 32 bits). It makes me think there’s something weird going on other than compiler stupidity. I’d be interested in seeing the code it was compiled from.

jchw · on July 14, 2023

I wonder if this is still true. Whenever I go to hook Win32 API functions, I use an off-the-shelf length disassembler to create a trampoline with the first n bytes of instructions and a jmp back, and then just patch in a jmp to my hook, but if this hot-patch point exists it'd be a lot less painful since you can avoid basically all of that.

Though, I guess even if it was, it'd be silly to rely on it even on x86 only. Maybe it would still make for a nice fast-path? Dunno.

gcoakes · on July 14, 2023

Good read. Thank you.

This just worsens my fear of changing "unnecessary" code when I don't know the original motivation for it.

pclmulqdq · on July 14, 2023

Intel Vtune will do this with 5-byte NOPs directly. I think LLVM's x-ray tracing suite did this with a much bigger NOP, also, to capture more information.

monocasa · on July 14, 2023

RISC-V has a whole HINT space that's basically just morphs of load immediate into zero register.

AArch64 has a similar space: https://developer.arm.com/documentation/ddi0596/2020-12/Base...

And yes, PowerPC has a similar space as well holding hints like 'give priority to the other hardware threads on this core' and the like. https://utcc.utoronto.ca/~cks/space/blog/tech/PowerPCInstruc...

rollcat · on July 14, 2023

I was wondering where did I read about PowerPC, and this is exactly the article! So, it was for thread priority. Strikes me as an odd design choice, this probably should've been something to be managed by the OS more explicitly.

monocasa · on July 15, 2023

I think the idea of exposing it to user space is to better handle concurrency before trapping into the kernel.

So consider the case of a standard mutex in the contended case. Normally the code will spin for a little bit before informing the kernel scheduler on the off chance that the thread that owns the lock is currently scheduled on another hardware thread. In that case it's in the best interest of the thread trying to grab the lock to shift most of the intracore priority to any other hardware threads so that it can potentially help the other hardware thread holding the lock get to a point where it gives up the lock quicker.

Someone · on July 14, 2023

https://www.ibm.com/docs/en/aix/7.3?topic=h-hpmstat-command:

“random_samp_ele_crit=name

Specifies the random criteria for selecting the instructions for sampling. Valid values for this option are as follows:

ALL_INSTR

All instructions are eligible. This value is the default setting.

LOAD_STORE

The operation is routed to the Load Store Unit (LSU); for example, load, store.

PROB_NOP

Sample only special no-operation instructions, which are called Probe NOP events.

[…]”

aidenn0 · on July 14, 2023

Some MIPS cores had a superscalar NOP that would stall every ALU by one cycle, which was necessary because they lacked synchronization instructions.

mattgreenrocks · on July 14, 2023

That’s really clever use of the opcode space. Thanks for passing that along.

SomeRndName11 · on July 14, 2023

NOP on intels is in fact xchg eax, eax

asveikau · on July 14, 2023

It was an old joke that the opposite of "goto" is "come from", or that if goto is considered harmful, nobody said anything about a "come from". Marking something as a branch target reminds me of this.

https://en.m.wikipedia.org/wiki/COMEFROM

dejj · on July 14, 2023

> GOTO considered harmful

COMEFROM considered harm-mitigating

It ingeniously makes Return Oriented Programming (ROP) a lot harder.

messe · on July 14, 2023

> COMEFROM considered harm-mitigating

You know, that’d be a fantastic OpenBSD release name.

Here’s hoping a dev sees this comment; there’s already been a few commenting in this thread.

wongarsu · on July 14, 2023

Interesting. Seems like enforcement on Intel CPUs is supported since Tiger Lake (so ~2020). Windows has basically the same feature implemented in software since 2015, called Control Flow Guard [1]. I wonder what the story there is, and if Windows has any plans to (get everyone to) switch to the hardware version once those CPUs have sufficient market share.

1: https://learn.microsoft.com/en-us/windows/win32/secbp/contro...

andersa · on July 14, 2023

Windows also recently implemented a far better version of this called Extended Flow Guard (XFG) that not only checks whether the location is a valid destination, but also whether it's a valid destination for that specific source.

For example, for any virtual function call or function pointer call, the destination must have a correct tag with the hash of the arguments. It's much more secure, and also faster, since loading the tag from memory can be merged with loading the actual code after it.

I wish this was the one implemented in hardware..

ComputerGuru · on July 15, 2023

There’s a great article on XFG here [0] but it observed that a failed XFG check downgrades to a regular CFG check instead of a denial.. meaning it adds zero extra protection? Perhaps this behavior has changed since the preview they tested, though!

[0]: https://www.offsec.com/offsec/extended-flow-guard/

andersa · on July 15, 2023

That can't be right, it would be entirely pointless then. It looks like the article was written during a pre-release time, so maybe it wasn't fully enabled?

I've not yet been able to use XFG in any production software, due to the requirement of rebuilding every static linked library with it enabled. But it didn't seem to fall back to CFG when I was testing it in a toy program.

simcop2387 · on July 14, 2023

That does sound like it would be more robust, but definitely sounds like it'd require a lot more silicon than the IBT that they did implement. Something like it might be something that comes in some future revisions.

saagarjha · on July 14, 2023

ARM does it!

haberman · on July 14, 2023

Interesting. I was able to get Clang to generate this using `-fcf-protection=branch`: https://godbolt.org/z/rooP8vPsM

It looks like endbr64 is a 4-byte instruction. That could be a significant code size overhead for jump tables with lots of targets: https://godbolt.org/z/xTPToaddh

notaplumber1 · on July 14, 2023

OpenBSD disables jump tables in Clang on amd64 due to IBT, some architectures also had jump tables disabled as part of the switch to --execute-only ("xonly") binaries by default, e.g: powerpc64/sparc64/hppa.

https://marc.info/?l=openbsd-cvs&m=168254711511764&w=2

E.g: https://marc.info/?l=openbsd-cvs&m=167337396024167&w=2

saagarjha · on July 14, 2023

Any idea what the performance impact is?

codedokode · on July 14, 2023

Why should every function start with endbr64 command? Aren't functions usually called directly?

Also, is it required to insert endbr64 command after function calls (for return address)?

eklitzke · on July 14, 2023

As to why they're not always called directly, imagine some code like this:

    int FooWithoutChecks(void *p);
    
    int Foo(void *p) {
      if (p == NULL) return -1;
      return FooWithoutChecks(p);
    }

In general the caller is expected to call Foo if they aren't sure if the pointer is nullable, or if they already know that pointer is not null (e.g. because they already checked it themselves) they can call FooWithoutChecks and avoid a null check that they know will never be true.

The naive way to emit assembly for this is to actually emit two separate functions, and have Foo call FooWithoutChecks the usual way. But notice that the FooWithoutChecks function call is a tail call, so the compiler can use tail call optimization. To do this it would inline FooWithoutChecks into Foo itself, so the compiler just emits code for Foo with the logic in FoowithoutChecks inlined into Foo. This is nice because now when you call Foo, you avoid a call/ret instruction, so you save two instructions on every call to Foo. But what if someone calls FooWithoutChecks? Simple, you just call at the offset into Foo just past the pointer comparison. This actually just works because Foo already has a ret instruction, so the call to FooWithoutChecks will just reuse the existing ret. This optimization also saves some space in the binary which has various benefits in and of itself.

The example here with the null pointer check is kind of contrived, but this kind of pattern happens a LOT in real code when you have a small wrapper function that does a tail call to another function, and isn't specific to pointer checks.

josephcsible · on July 14, 2023

> Why should every function start with endbr64 command? Aren't functions usually called directly?

They're usually called directly, but unless the compiler can prove that they always are (e.g., if they're static and nothing in the same file takes the address), endbr64 is required.

> Also, is it required to insert endbr64 command after function calls (for return address)?

No, IBT is only for jmp and call. SS is the equivalent mechanism for ret.

derefr · on July 14, 2023

> but unless the compiler can prove that they always are (e.g., if they're static and nothing in the same file takes the address), endbr64 is required

Then why not just have the compiler break down every non-static function into two blocks: a static function that contains all the logic, and a non-static function that just contains an IBT and a direct jump to the static function? (Or, better yet, place the non-static label just before the static one, and have the non-static fall through into the body of the static.) Then the static direct callsites won't have to pay the overhead of executing the IBT NOP.

95014_refugee · on July 14, 2023

The IBT NOP is "free" in that it will evaporate in the pipeline; it still has to be fetched and decoded to some extent, but it does not consume execution resources.

From a tooling perspective, what you're describing (two entrypoints for a function, the jump you mention is pointless) would require changes up and down the toolchain; it would affect the compiler, all linkers, all debuggers, etc. By contrast, just adding an additional instruction to the function prolog is relatively low-impact.

It's also worth noting that at the time code for a function is emitted, the compiler is not aware of whether the symbol will be exported and thus discoverable in some other module, or by symbol table lookup, so emitting the target instruction is essentially mandatory.

dzaima · on July 14, 2023

Doesn't seem like it'd be that difficult to make the change the other direction, i.e. keep endbr64 as-is as the default case, but if there's a direct jump/call to anywhere that starts with endbr64, offset the immediate by 4 bytes; could be done in any single stage of toolchain that has that info with no extra help. But yeah, quite low impact, might not even affect decode throughput & cache usage for at least one of the direct or indirect cases.

me-vs-cat · on July 14, 2023

> Doesn't seem like it'd be that difficult

Show me the code -- better yet, submit it to the relevant projects! :)

Joker_vD · on July 14, 2023

That's absolutely doable, just... How much is predicted unconditional jump slower/faster than ENDBR64? What's the ratio of virtual/static calls in real-world programs? And while your last proposal ("foo: endbr64; foo_internal: <code>") evades those questions, it raises up questions about maintaining function alignment (16 bytes IIRC? Is this even necessary today?) and restructuring the compiler to distinguish the inner/external symbol addresses. Plus, of course, somebody has to actually sit down and write the code to implement that, as opposed to just adding "if (func->is_escaping) emit_endbr(...);" at the beginning of the code that emits the object code for a function body.

saagarjha · on July 14, 2023

That sounds a lot like “add a prefix to the function with an endbr64 instruction”.

tedunangst · on July 14, 2023

What is the overhead of executing the IBT NOP?

95014_refugee · on July 14, 2023

It's not "executed" per se. It consumes space in the cache hierarchy, and a slot in the front-end decoder. It won't ever be issued, but depending on the microarchitecture in question it might result in an issue cycle having less occupancy than it might have had in the case where the subsequent instruction was available.

With that said, the first few instructions of a called function often stall due to stack pointer dependencies, etc. so the true execution cost is likely to be even smaller than the above might suggest.

messe · on July 14, 2023

C allows for any function to be called via a function pointer, and functions can be in different translation units, so the compiler can't simply assume that a function will never be called indirectly and has to pessimistically insert endbr64 in order to maintain a reasonable ABI.

And no, as I understand it, this is only for branch/calls not returns.

Joker_vD · on July 14, 2023

Well, if the function is marked "static", the compiler can actually check whether the function's address is taken in the current compilation unit or not and omit/emit ENDBR64 accordingly (passing pointers to static functions to code in another compilation units is legal, and should still work).

messe · on July 14, 2023

Good catch. Yeah, as long as the functions address is never taken the compiler has a lot of leeway with static functions; it can even avoid emitting code for them entirely if it can prove they're never called or if it's able to compute their results at compile-time.

josephg · on July 14, 2023

Yep. Or inline them at every call site if that makes sense to do based on the optimization level and flags.

MobiusHorizons · on July 14, 2023

Is this theoretically something lto could remove?

tedunangst · on July 14, 2023

If you disable dlopen and ld_preload.

codedokode · on July 14, 2023

Dlopen() "sees" only functions marked as exported (with macro like DLLEXPORT on Windows), not every function or am I wrong? Is C that bad?

tedunangst · on July 14, 2023

On openbsd at least, every global symbol is exported unless you use an explicit symbol list. It's unusual for executables.

aidenn0 · on July 14, 2023

A traditional compiler needs to insert them for all external functions, because other compilation units may make an indirect call.

cratermoon · on July 14, 2023

In case anyone wants a very simple introduction to JOP/COP exploits and mitigations of this type: <https://www.theregister.com/2020/06/15/intel_cet_tiger_lake/>

__failbit · on July 14, 2023

Thank you for the explanation!