ARM Assembly: ∞ Ways to Return (2017)

derf_ · on Feb 10, 2023

In 2010, I tried to use callgrind to profile a project on Arm, after having used it to great effect on x86, and discovered that because of the variety of ways to return (and call!) functions on Arm, callgrind was unable to reliably identify function call and return sites. It created cycles in the call graph and even failed to record a function's self measurements correctly (because it could not tell when you left that function).

The problem boiled down to the valgrind frontend code that splits things up into basic blocks being incapable of having an instruction be both a conditional jump and a function call / return at the same time. That never happens on x86, but of course this is possible (and totally normal) on 32-bit Arm. Sadly, I ran out of time to try to re-architect this code and had to move on to other projects.

Over 12 years later, it looks like it never did get fixed: https://bugs.kde.org/show_bug.cgi?id=252091

Dwedit · on Feb 10, 2023

I use stuff like "bxeq lr" all the time. There's your conditional return instruction.

klodolph · on Feb 10, 2023

Thankfully, PC is no longer a GPR in ARM64. Making PC a GPR seems elegant at first glance, but when you actually dive into it and see how it affects processor implementations and how it affects the code you write, it turns out to be extremely messy and inconvenient. Good riddance PC as GPR, don’t let the door hit you on the way out.

crest · on Feb 10, 2023

It's neat when writing assembler e.g. add a scaled byte value to the PC to implement a jump table or perform a scaled and indexed load to the PC. In ARM it also produced a neat short and fast function prologue/epilogue. In my opinion the worst problem causes are the 1001 and one special cases it adds in an optimised out of order implementation. The Thumb interworking makes it more worse, but is useful to increase code density in ARM v6-M and can even increase performance (per clock) of ARM v7-M cores. I don't expect it causes too much problems in single-issue in-order implementations like the Cortex M3 and M4. I would like to know how much design time and core area is spend on this in the M7 and M85 cores.

sweetjuly · on Feb 11, 2023

Even for regular in-order cores, it makes branch prediction a massive pain because now your fast frontend predictors need to essentially fully decode the instruction in order to determine if it can be considered a branch. Most other ISAs make this simple because there are only a few opcodes that change control flow and so you can very easily just stuff that in your early frontend decoder.

RISCV unfortunately didn't quite do this well since return uses the same opcode for call, return, and indirect branch and so you have to fully decode the instruction in order to determine whether you should use the RAS or your other predictors. This isn't a problem that can't be overcome (next line predictors help a lot for these early predictions) but it makes something very performance critical just that much harder.

ksherlock · on Feb 10, 2023

Early versions of ARM (ARM 1/2, optional in 3/4) had a combined program counter / status register; since there was only a 26-bit address space and instructions are always 32-bit word aligned, the top 6 and bottom 2 bits were used for the status register.

So, if you're still developing for an ARM1, not all of these are equivalent. MOV/POP/etc will set the PC and the status register; B/BL will leave the status register bits alone.

* edit: MOV/MOVS determined if the status bits are written to R15.

Dwedit · on Feb 10, 2023

Method 1 (popping PC off the stack) and Method 3 (mov pc,lr) do not work on the earliest ARM processors that support THUMB, as it will not switch to THUMB mode without executing a BX instruction.

Checking reference manuals:

ARMV4T (ARM7TDMI/ARM9TDMI): Does NOT switch to THUMB mode automatically

ARMV5: Does NOT switch to THUMB mode automatically

ARMV7: Does switch to THUMB mode automatically

benj111 · on Feb 10, 2023

I thought it switched to thumb based on odd/ evenness.

Plus if you don't want to switch to thumb, this still works?

pm215 · on Feb 10, 2023

It is based on whether the low bit of the jump target is 0 or 1, but the first version of Thumb only did that check-and-switch-mode on a small set of jump instructions, not on every way you could alter the program counter. For the others you got the same behaviour you always had for an attempt to jump to an unaligned address, which is to say the low bit was just ignored. The compiler had to generate slightly different code if you wanted your function to support interworking. In the versions of Thumb starting with IIRC Armv5t or maybe v6t2, more instructions did the mode switch check, and codegen got a bit simpler.

vore · on Feb 11, 2023

On older ARM only bx is allowed to switch Thumb state, even if the address you're giving it is e.g. a Thumb address in ARM mode. You can still use pc as a GPR to jump ARM-ARM or Thumb-Thumb, though.

Dwedit · on Feb 10, 2023

It does work if you intend to stay in ARM mode only, and will crash if THUMB-mode code calls the function. ARMV7 will do the mode switch automatically and not crash.

Olipro · on Feb 10, 2023

For older architectures, you really want to use the BX instruction unless you can guarantee you're not switching execution mode.

as a bit of pointless trivia, MOV PC, PC does not cause an infinite loop - it skips the instruction immediately following.

Dwedit · on Feb 11, 2023

Correct, PC as a source register means PC + 8.

schoen · on Feb 11, 2023

It's sad (but very reasonable) that newer architectures have tried to be significantly less flexible about control flow for security reasons, in the direction of "there is only one way to call a function, and only one way to return from a function, and you have to tell the system where your functions and returns are so that someone can't call or return into the middle or can't leave the function at an unexpected point" (and, of course, no self-modifying code, and even discouraging JITs).

not2b · on Feb 10, 2023

As other commenters have mentioned, exploiting this will confuse other tools and debuggers. Also it tends to play havoc with branch prediction meaning that there may be performance penalties.

garbagecoder · on Feb 10, 2023

fwiw, if you're using ARM assembly on an Apple device there are a few differences and one of them is how you pass arguments.

https://developer.apple.com/documentation/xcode/writing-arm6...

wk_end · on Feb 10, 2023

The article is describing classic ARM. (Modern) Apple devices are all ARM64, which doesn't have the PC as a GPR.

The article is also entirely about how the PC is a general-purpose register on 32-bit ARM machines. No idea if the 1st gen iPhones or whatever used an idiosyncratic calling convention...but it's moot in the context of this post, because argument passing isn't covered here!

This post really is just about the observation that the PC is a GPR implies that there's a bunch of different ways to get data into it. It's pretty airy. The author was admittedly a first or second year university student at the time, so it's hard to be too mad though.

flykespice · on Feb 10, 2023

The only one getting mad here is you for no reason.

ngcc_hk · on Feb 11, 2023

Quite interesting. Wonder any analysis about at least some of these choices.

hun3 · on Feb 11, 2023

Anyone has the link to the usenet discussion saying that PC (R15) as a GPR was too "uniform" (idk how it called it)?