Hacker News new | comments | show | ask | jobs | submit login
ARM immediate value encoding (mcdiarmid.org)
181 points by cornet 1257 days ago | hide | past | web | 70 comments | favorite



Thumb-2 immediate encoding is even more gleeful--in addition to allowing rotation, it also allows for spaced repetition of any 8-bit pattern (common in low level hack patterns, like from [1]) to be encoded in single instructions.

For those interested, check out page 122 of the ARMv7-M architecture reference manual[2]:

  // ThumbExpandImm_C()
  // ==================
  (bits(32), bit) ThumbExpandImm_C(bits(12) imm12, bit carry_in)
    if imm12<11:10> == ’00’ then
      case imm12<9:8> of
        when ’00’
          imm32 = ZeroExtend(imm12<7:0>, 32);
        when ’01’
          if imm12<7:0> == ’00000000’ then UNPREDICTABLE;
          imm32 = ’00000000’ : imm12<7:0> : ’00000000’ : imm12<7:0>;
        when ’10’
          if imm12<7:0> == ’00000000’ then UNPREDICTABLE;
          imm32 = imm12<7:0> : ’00000000’ : imm12<7:0> : ’00000000’;
        when ’11’
          if imm12<7:0> == ’00000000’ then UNPREDICTABLE;
          imm32 = imm12<7:0> : imm12<7:0> : imm12<7:0> : imm12<7:0>;
      carry_out = carry_in;
  else
    unrotated_value = ZeroExtend(’1’:imm12<6:0>, 32);
    (imm32, carry_out) = ROR_C(unrotated_value, UInt(imm12<11:7>));
  return (imm32, carry_out)
[1] http://graphics.stanford.edu/~seander/bithacks.html (worth a read on its own if you're into this kind of thing)

[2] http://web.eecs.umich.edu/~prabal/teaching/eecs373-f10/readi... (no-registration link)


The set of representable ARM immediates is really nice. It's wonderfully useful for writing soft-float and math library routines, where you have very common values with just some high-order bits set:

    0x3f800000 // encoding of 1.0f
The set of immediate encodings, together with "shifts for free on most operations" (which are closely related features, as the OP points out), went a long way toward preserving my sanity when writing assembly.

Worth noting: thumb-2 immediates have a different (and even more interesting) encoding scheme. arm64 immediates are pretty interesting too (there the set of representable immediates is different depending on the instruction domain).


Honestly, with so few bits, I was expecting it to be a lookup table. (Yep, I've never wrote ARM assembly.)

But then, this way you have a nice set of imediates (as you said), and can set any value at all with at most 3 instructions at the rare case you need something different.


> I was expecting it to be a lookup table.

There's one ISA I've worked with before that does have -2, -1, 0, 1, 2, and some other "commonly used constants" like powers of 2 encoded specially in the immediate. I think it was an 8-bit, but I can't remember exactly which one. Anyone know what I'm referring to?


Similar, but not what you're thinking of: the x87 instruction set has instructions to push the values 1.0, lb(10), lb(e), pi, lg(2), ln(2), and 0.0.


The MSP430 has a constant generator register, which depending on the access mode used will generate offsets and zero. I'm sure there are a few others.


That's the scheme. Although probably MSP430 took its inspiration from the one I had in mind since I was thinking of an 8-bit MCU from the early 80s - might've been Motorola.


MSP430?


If the reply link is missing on a comment (it does that after the comments get to a certain depth of nesting) then click on the comment's "link" and you'll get a text box you can reply into.


Not sure why you're telling me this, I suggested MSP430 to the parent post as an answer for the ISA he was looking for.


Ah ok, theatrus2 was the only mention of MSP430 in the discussion under userbinator and in the absence of a "How about " in front of "MSP430?" I assumed you meant "What's a MSP 430?" in reply to theatrus2....but then I should've checked the comment times :)


Do you know a good place for finding out more about the arm64 encoding?


ARM's reference manual (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....). You need to create a (free?) account with ARM to download it. Of course, the document has been out for long enough that I'm sure it's available elsewhere for those truly paranoid about creating accounts.


The Tensilica guys took this thing an extra step. ie- profile real code to find out what constants are most typically used, enumerate the top n constants, encode the constant with 0..n-1 in the immediate instruction - the immediate value is a hardware based lookup. You can still do arbitrary immediates with longer instructions but you can apparently get some nice code size reductions using this technique.


The flipside of that is that the scheme described here will take less silicon, fewer transistors and thus . . . use less power, if you happen to optimise the code appropriately.

Code size reductions are good, but for power purposes it is a case of balancing them against decoding complexity.


Holy crap. ARM's encoding makes so much sense... compared to what I have waded through at sandpile.org.


The x86 encoding actually makes sense once you begin to write its instructions using octal instead of hexadecimal numbers: http://www.dabo.de/ccc99/www.camp.ccc.de/radio/help.txt (this isn't mentioned in the Intel or AMD docs).


This encoding is nice given that they've already paid the price of having the 32b barrel shifter, but it was a non-obvious choice to have the barrel shifter to begin with. Most instructions don't benefit from the optional rotate, but they pay the price in the encoding and in the data path.

Interestingly, the website uses svg for illustrations, IE 8 and under be damned.


I was left with one question after reading the article: the purpose of the condition field in the instruction.


In the ARM instruction encoding, every arithmetic and logical instruction is "conditional". The destination register is either updated or not depending on the four bit condition field and the state of the condition flags in the processor.

As a simple contrived example, consider the following C code:

    int a[100], b[100], count;
    ...
    for (int i=0; i<100; i++) {
        if (a[i] > b[i]) count++;
    }
without conditional execution, one might compile this to code that uses a branch to either increment count or not; on ARM it would be more idiomatic to use conditional execution. Here's a very literal translation as an example (not tested, apologies for any inadvertent errors):

    // setup: a in R0, b in R1, count in R2, i in R3.
    loop: LDR   R4, [R0, R3, LSL #2] // load a[i]
          LDR   R5, [R1, R3, LSL #2] // load b[i]
          CMP   R4, R5               // if a[i] > b[i]
          ADDGT R2, R2, #1           //     count++
          ADD   R3, R3, #1           // i++
          CMP   R3, #100             // if (i < 100)
          BLT   loop                 //     continue loop
The fourth instruction, ADDGT, is conditional. Count is only updated with the result of the addition if the "greater than" condition is satisfied (the flags were set by the preceding instruction). To be more precise, all of the instructions here are conditional, it's just that for most of them the condition field is 1110, meaning "always".

Many instructions also have an "S" bit, which toggles whether or not they update the flags on which conditional execution depends. Taken together, these two features allow a clever assembly programmer to do some really clever things (but historically not too much effort has been directed at getting compilers to make really clever use of these features).

For low-power parts, this is a cute trick, as it allows a programmer to avoid stressing a limited branch predictor with lots of small branches. It does add some complication to the implementation however, especially when you get into designs that retire multiple instructions per cycle or support out-of-order execution, as conditional execution basically adds additional dependencies to every instruction.


I remember there was a "never" condition, which was present just for completeness; it turns out ARM eventually found that having 2^28 different NOPs would not be a good use of opcode space, so it's now a special extension for newer instructions...


I seem to recall from my ARM Assembler coding days that there was also a noop instruction, which of course could be conditional itself, so if you didn't actually want to do the NOOP, you could do NOOP-NE, which wouldn't do anything twice over.


From my days coding ARM assembly on the Acorn Archimedes, NOP was typically an alias for MOV R0,R0 (which effectively did nothing) rather than being its own instruction.


And if you ever needed to manually patch in an easy-to-remember NOP, 0x00000000 was ANDEQ R0, R0, R0.


That's a beautiful explanation. It's one of my favorite things about the ARM instruction set.

That said, it also means debugging becomes a bit more painful. Let's say you want your (cheap) JTAG debugger to halt on the count++ instruction. You can hard break on that particular address in code, but you will always hit that address whether the condition was met or not.


Knowing nothing about the tools, why can't you set a breakpoint based on a bitmask of the instruction pointer?

I'm new to the assembly world, I've been working my way down and have gotten as far as Forth.


Better tools can, thus the (cheap) qualifier on my comment.


This was part of the beauty of ARM when I learned it as a teenager back in the early 90s. Very simple and elegant, and writing ARM code by hand was enjoyable. Coming back to ARM now, though, in this form of Cortex-M microcontrollers, I see that things have become muddied with things like if-then-else instructions and mixed 16-bit/32-bit Thumb-2 code.


Simple, elegant, and it eats an astonishing 12.5% of instruction bandwidth (4 bits out of 32). A branch will require less space as soon as you want to conditionally execute over 8 instructions. On top of that, for that is executed unconditionally (in practice, maybe not most of the instructions if you look at the binary, but almost certainly most of the instructions if you count ones executed multiple times)

So, a neat idea, but not for the long term, and certainly not for all CPUs. For example, ARM 64 ditches this feature (https://www.mikeash.com/pyblog/friday-qa-2013-09-27-arm64-an...)


arm64 ... sort of ditches conditional execution. It’s not on every instruction any more, but it’s still available on more instructions than on most other arches.

To the usual complement of typical conditional instructions (branch, add/sub with carry, select and set), arm64 adds select with increment, negate, or inversion, the ability to conditionally set to -1 as well as +1, and the ability to conditionally compare and merge the flags in a fairly flexible manner (it’s really a conditional select of condition flags between the result of a comparison and an immediate). This actually preserves most of the power of conditional execution (except for really exotic hand-coded usages), while taking up much less encoding space.


Your last sentense made me realise why ARM64 no longer has conditional instructions. Obviously it's designed for higher power situations than most ARM cores, and OOO is an important part of doing that efficiently. reduced instruction deps == better OOO.


http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

"Almost all ARM instructions can include an optional condition code. An instruction with a condition code is only executed if the condition code flags in the CPSR meet the specified condition. The condition codes that you can use are shown in Table 4.2."

f.e. execute this instruction only if the previous instruction resulted in a negative number


From one of data sheets:

In ARM state, all instructions are conditionally executed according to the state of the CPSR condition codes and the instruction’s condition field. This field (bits 31:28) determines the circumstances under which an instruction is to be executed. If the state of the C, N, Z and V flags fulfils the conditions encoded by the field, the instruction is executed, otherwise it is ignored.

When condition is set the mnemonic of instruction is extended with one of suffixes like EQ, NE, CS, CC etc.


Yep consider also this quick example on wiki: http://en.wikipedia.org/wiki/ARM_architecture#Conditional_ex...


Very cool and clever scheme. But what happens to immediates that can't be encoded that way?


There are a few options for producing other values:

1. load them from a constant pool in memory (often this is put nearly inline in the instruction stream so that it can be addressed via a small constant offset from PC). If your code doesn't saturate the LSU locally, this is usually the best option.

2. assemble them via bitwise or arithmetic combinations of literals that you can represent (or already in-register values that you know something about). If you can do it with just a few operations, and the LSU is otherwise occupied, you should do this instead. This is also preferred on some limited cores that can dual-issue logical and arithmetic ops but not LSU ops.

3. in recent revisions of the instruction set, there are a pair of instructions MOVW and MOVT; MOVW conjures a 16b immediate, and MOVT sets the high 16b, so with these two instructions you can conjure any 32b value.

There is a bit of subtlety to choosing the correct approach. Some assemblers provide a "conjure this value" pseudo-operation where the assembler will choose what it thinks is the best option to materialize a constant; that way the programmer doesn't need to worry about these details. This looks something like the following:

    LDR R0, =0xff0000ff
the assembler might actually stash the constant somewhere and generate a load instruction, or it might instead do:

    MOV R0,     0xff000000
    ORR R0, R0, 0x000000ff
and assemble the value via a couple instructions with immediates.


Here's what you can do to add a "complicated" constant stored elsewhere, in hand-crafted assembler:

    add_something:        ; function starts here (argument r0 == some number)
	ldr r1, __tmp     ; get complicated constant, store in r1
	add r0, r0, r1    ; do the addition r0 = r0 + r1
	bx lr            ; == return result (in r0)
    __tmp:
	.word 0x12345678  ; store complicated constant here
You can play with your compiler, if you call gcc as "gcc -Os -S -o- file.c" if will spit out generated assembler code (-S) on stdout "-o-" for the c-code in file.c.

(but then, gcc prefers to have 4 "compact" adds, instead of loading a constant...)

    $ cat dummy.c
    int
    add_random_number(int a)
    {
            return a + 0x12345678; /* guaranteed to be random */
    }

    $ arm-none-eabi-gcc -S -o- -Os dummy.c
    (...)
    add_random_number:
            @ Function supports interworking.
            @ args = 0, pretend = 0, frame = 0
            @ frame_needed = 0, uses_anonymous_args = 0
            @ link register save eliminated.
            add     r0, r0, #301989888
            add     r0, r0, #3424256
            add     r0, r0, #5696
            add     r0, r0, #56
            bx      lr


In at least the ADT assembler and gas a few years back, the assembler provided syntactic sugar:

    ldr r0,=0x123456578
assembles to

    ldr r0,pc+xxx
    ... and then somewhere later ...
    .word 0x12345678
The assembler had some default places it would put constant pools (end of a module?), or you could explicitly tell it to generate a constant pool if the default place would be outside the limit of the pc-relative addressing mode.


From my experience on x86, GCC's -Os isn't that great - looks like it's the same for ARM.


You have to place them in a constant pool and load them.


As others have said, that's not the case. All constants can be constructed by ORing together a small collection of expressible constants. With a little ingenuity "most" require very few instructions to build. FWIW, I have written real-time radar processing embedded software, and rarely had to resort to stored constants.


Article begins by praising RISC as "elegant" and "a good design decision", goes on to describe limitations of RISC immediate values.


You're being a little disingenuous. It describes the ARM architecture as 'elegant', and 1/4 of the way through the article describes the common use of fixed length instructions in various RISC architectures as 'a good design', and explains why this is mostly a win. And then it explains how the ARM manages to encode a wide range of useful immediate values using a very elegant and simple scheme.

In all my years coding in ARM assembly language, the range of immediate values it supported was rarely if ever an issue.


I agree that this clever and useful, but I get the feeling that it could have been more so.

I haven't done much assembly in a while, but I was heavily into it once upon a time, and I recall that values with lots of 1s were useful. There is not quick way to generate those here. This means that we can write a single instruction to set any single bit using an inclusive OR and the proper immediate value, but we cannot write a single instruction to clear any single bit.

The reason I think a bit more cleverness might have helped is that there are so many values with multiple encodings. Anything where the 8-bit value ends with 0 has a different encoding as well. For example, a rotation of 0000 and an 8-bit value of 00000100 gives the same result as a rotation of 1111 and an 8-bit value of 00000001 (right?). Perhaps some of the redundant instructions could have been used to represent things ending in lots of 1s?

Regardless, an interesting and informative post. :-)


Clearing an arbitrary bit actually is supported (as mentioned in the article: "you can set, clear, or toggle any bit with one instruction").

The specific details require you to dig a bit past explaining just the immediate encoding, but in the clearing case there's a dedicated instruction for clearing the bit specified by the immediate:

  BIC - Bit Clear (immediate) performs a bitwise AND
  of a register value and the complement of an immediate
  value, and writes the result to the destination register.
As I mentioned in my other post, zero-rotation encodings are gamed out as well (to allow byte repetition).


> BIC - Bit Clear (immediate)

Ah, didn't catch that.


You can get trailing ones by subtracting 1:

0x01000000 - 1 = 0x00ffffff

so you can generate any number of trailing ones with 2 instructions.


I love that the author described the ARM as "elegant, pragmatic, and quirky." It reminds me of Gordon Bell's PCP-6/PDP-10 architecture, but applied to the RISC rather than CISC philosophy.

(well, the PDP-10 was pretty RISC for its day and gave us things like BLT, hence bitblit).


So arm compilers must prefer to, for example, XOR with 0x10000000 rather than AND with 0xEFFFFFFF?


AND with 0xefffffff would become BIC ("bit clear", aka "and not") with 0x10000000. But yes, the basic idea of your comment is correct. Fortunately, compilers are quite good at this sort of thing.


And if you need something like ~(byte << N) for any purpose there's still MVN (move-not).

     MVN r0, #0x10000000 ; ro = 0xefffffff
Oh, the joy :-).


CMN is the real gem. It's an endless source of bugs in the time between when compiler writers discover it and when they figure out how it actually sets flags. ARM would have done well to provide a "here's how you actually use this instruction" guide in the architecture reference manual.


Aside from this kind of trick, I have not worked with a ton of ARM assembly but what I find is very frequently compilers will just put values in the text section close to where they are used and refer to them with some PC-relative thing, rather than using immediate values as often as they might on x86. I have seen this on other RISC platforms as well.


Wouldn't that require the pc-relative address to fit into this immediate scheme?


In order to have a single-instruction PR-relative load the offset needs to be small (though it uses a different immediate scheme for the offset). So in practice you usually tuck small constant islands between blocks of executable code when you adopt this approach.


That's what I was thinking. I also started wondering how a large structure might be packed, when you can relatively address every byte in the nearest 256 bytes but only every 4 bytes in the next KB and so on. It could sort of upend the way you might normally think about packing a struct.


This is clever. However...

The problem as I see it with this sort of cleverness is that it's difficult to optimize to this. It leads to quite variable best-case and worst-case scenarios and general unpredictability. As, say, a C programmer and not a compiler designer, you might unintentionally pick lots of values that won't fit into the "immediate" scheme (worst-case). Or you might force your design to use numbers that DO fit into this scheme (best-case, but a bleed of lower-level design decisions affecting higher-level design decisions).


These days if what you're writing is in C then you should really know the behaviour of the architecture underneath.

ARM really isn't that hard if you're already thinking like a low level programmer. Things like MIPS were (better) designed for being targeted from higher level languages, but the consequence is a much messier machine language. It's always struck me as amusing that the conventional view is MIPS is minimal, when ARM is really much more so, but it's from outside the Berkeley/Stanford RISC bubble so didn't really get on to their radars for some time.


There is lots of C code that is written to be portable, and trusts the compiler to generate optimized assembly. But those are probably ok with spending a couple of extra cycles for some immediate values.


Why is this cool and clever? It still only encodes 12 bits of data, it is just different to using the normal 12 bits of data.

Is this a more useful subset? I am guessing it is so, since they went to this trouble.


It can generate 4096 different slightly useful 32-bit values. For others, you may assemble them using a variety of methods.


It generates fewer than 4096 unique values, as some values map to the same thing. The most obvious being that 0 shifted any number of places produces 0, but also 0x10 shifted left 0 is the same as 0x04 shifted left 2 is the same as 0x01 shifted left 4.


Makes sense. Well... It produces less than 4096 interesting 32-bit values.


If I've interpreted this correctly, it means that all values between 256 and 65535 can't be encoded this way since they all have the form

0x0000nn00

with a nonzero nn, and those are the bits that can't be gotten to from rotating.


Anything of the form 0x0000nn00 is representable, as it has at most eight contiguous non-zero bits starting at an even bit position (i.e. these values are 0x000000nn rotated right by 24). Maybe you're wondering how a rotation of 24 can be encoded in four bits? To get the rotation amount, you double the value of the four-bit field. Only even rotation counts are representable.


Thanks, overlooked that detail. Very interesting encoding scheme.


Sure they can, by setting the rotate bits to 0xC (1100). Have a play with the interactive widget near the bottom of the article.

Interestingly, there is some redundancy with this encoding meaning you can't represent a full 2^12 unique values. Eg 0x1 could be represented by 0x1 and 0x0 in the rotate field, 0x4 with 0x1 in the rotate field, 0x10 with 0x2, or 0x40 with 0x3. The same applies for every other combination - this means that there are only (2^12)/4 = 1024 unique values that are representable in total.


You didn't do that math right. Out of every 256 values at a particular offset, only a quarter of them are under 64 and can be encoded with rotate+1.

Doing a quick brute force test, it appears that there are 3073 unique values. I suppose that makes sense. Each new rotate introduces 192 new values that have at least one of the new bits set, and 192 * 16 = 3072. Then you have the number 0.


Nice but somewhat obsolete I think: AFAIK the ARMv8 (64bit) ISA is different..


THAT Barrel Shifter, son!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: