
ARM immediate value encoding - cornet
http://alisdair.mcdiarmid.org/2014/01/12/arm-immediate-value-encoding.html
======
eckzow
Thumb-2 immediate encoding is even more gleeful--in addition to allowing
rotation, it also allows for spaced repetition of any 8-bit pattern (common in
low level hack patterns, like from [1]) to be encoded in single instructions.

For those interested, check out page 122 of the ARMv7-M architecture reference
manual[2]:

    
    
      // ThumbExpandImm_C()
      // ==================
      (bits(32), bit) ThumbExpandImm_C(bits(12) imm12, bit carry_in)
        if imm12<11:10> == ’00’ then
          case imm12<9:8> of
            when ’00’
              imm32 = ZeroExtend(imm12<7:0>, 32);
            when ’01’
              if imm12<7:0> == ’00000000’ then UNPREDICTABLE;
              imm32 = ’00000000’ : imm12<7:0> : ’00000000’ : imm12<7:0>;
            when ’10’
              if imm12<7:0> == ’00000000’ then UNPREDICTABLE;
              imm32 = imm12<7:0> : ’00000000’ : imm12<7:0> : ’00000000’;
            when ’11’
              if imm12<7:0> == ’00000000’ then UNPREDICTABLE;
              imm32 = imm12<7:0> : imm12<7:0> : imm12<7:0> : imm12<7:0>;
          carry_out = carry_in;
      else
        unrotated_value = ZeroExtend(’1’:imm12<6:0>, 32);
        (imm32, carry_out) = ROR_C(unrotated_value, UInt(imm12<11:7>));
      return (imm32, carry_out)
    

[1]
[http://graphics.stanford.edu/~seander/bithacks.html](http://graphics.stanford.edu/~seander/bithacks.html)
(worth a read on its own if you're into this kind of thing)

[2]
[http://web.eecs.umich.edu/~prabal/teaching/eecs373-f10/readi...](http://web.eecs.umich.edu/~prabal/teaching/eecs373-f10/readings/ARMv7-M_ARM.pdf)
(no-registration link)

------
stephencanon
The set of representable ARM immediates is really nice. It's wonderfully
useful for writing soft-float and math library routines, where you have very
common values with just some high-order bits set:

    
    
        0x3f800000 // encoding of 1.0f
    

The set of immediate encodings, together with "shifts for free on most
operations" (which are closely related features, as the OP points out), went a
long way toward preserving my sanity when writing assembly.

Worth noting: thumb-2 immediates have a different (and even more interesting)
encoding scheme. arm64 immediates are pretty interesting too (there the set of
representable immediates is different depending on the instruction domain).

~~~
marcosdumay
Honestly, with so few bits, I was expecting it to be a lookup table. (Yep,
I've never wrote ARM assembly.)

But then, this way you have a nice set of imediates (as you said), and can set
any value at all with at most 3 instructions at the rare case you need
something different.

~~~
userbinator
> I was expecting it to be a lookup table.

There's one ISA I've worked with before that does have -2, -1, 0, 1, 2, and
some other "commonly used constants" like powers of 2 encoded specially in the
immediate. I think it was an 8-bit, but I can't remember exactly which one.
Anyone know what I'm referring to?

~~~
theatrus2
The MSP430 has a constant generator register, which depending on the access
mode used will generate offsets and zero. I'm sure there are a few others.

~~~
userbinator
That's the scheme. Although probably MSP430 took its inspiration from the one
I had in mind since I was thinking of an 8-bit MCU from the early 80s -
might've been Motorola.

------
deadsy
The Tensilica guys took this thing an extra step. ie- profile real code to
find out what constants are most typically used, enumerate the top n
constants, encode the constant with 0..n-1 in the immediate instruction - the
immediate value is a hardware based lookup. You can still do arbitrary
immediates with longer instructions but you can apparently get some nice code
size reductions using this technique.

~~~
fidotron
The flipside of that is that the scheme described here will take less silicon,
fewer transistors and thus . . . use less power, if you happen to optimise the
code appropriately.

Code size reductions are good, but for power purposes it is a case of
balancing them against decoding complexity.

------
chewxy
Holy crap. ARM's encoding makes so much sense... compared to what I have waded
through at sandpile.org.

~~~
wolfgke
The x86 encoding actually makes sense once you begin to write its instructions
using octal instead of hexadecimal numbers:
[http://www.dabo.de/ccc99/www.camp.ccc.de/radio/help.txt](http://www.dabo.de/ccc99/www.camp.ccc.de/radio/help.txt)
(this isn't mentioned in the Intel or AMD docs).

------
tasty_freeze
This encoding is nice given that they've already paid the price of having the
32b barrel shifter, but it was a non-obvious choice to have the barrel shifter
to begin with. Most instructions don't benefit from the optional rotate, but
they pay the price in the encoding and in the data path.

Interestingly, the website uses svg for illustrations, IE 8 and under be
damned.

------
wreegab
I was left with one question after reading the article: the purpose of the
condition field in the instruction.

~~~
stephencanon
In the ARM instruction encoding, _every_ arithmetic and logical instruction is
"conditional". The destination register is either updated or not depending on
the four bit condition field and the state of the condition flags in the
processor.

As a simple contrived example, consider the following C code:

    
    
        int a[100], b[100], count;
        ...
        for (int i=0; i<100; i++) {
            if (a[i] > b[i]) count++;
        }
    

without conditional execution, one might compile this to code that uses a
branch to either increment count or not; on ARM it would be more idiomatic to
use conditional execution. Here's a very literal translation as an example
(not tested, apologies for any inadvertent errors):

    
    
        // setup: a in R0, b in R1, count in R2, i in R3.
        loop: LDR   R4, [R0, R3, LSL #2] // load a[i]
              LDR   R5, [R1, R3, LSL #2] // load b[i]
              CMP   R4, R5               // if a[i] > b[i]
              ADDGT R2, R2, #1           //     count++
              ADD   R3, R3, #1           // i++
              CMP   R3, #100             // if (i < 100)
              BLT   loop                 //     continue loop
    

The fourth instruction, ADDGT, is _conditional_. Count is only updated with
the result of the addition if the "greater than" condition is satisfied (the
flags were set by the preceding instruction). To be more precise, all of the
instructions here are conditional, it's just that for most of them the
condition field is 1110, meaning "always".

Many instructions also have an "S" bit, which toggles whether or not they
update the flags on which conditional execution depends. Taken together, these
two features allow a clever assembly programmer to do some really clever
things (but historically not too much effort has been directed at getting
compilers to make really clever use of these features).

For low-power parts, this is a cute trick, as it allows a programmer to avoid
stressing a limited branch predictor with lots of small branches. It does add
some complication to the implementation however, especially when you get into
designs that retire multiple instructions per cycle or support out-of-order
execution, as conditional execution basically adds additional dependencies to
every instruction.

~~~
userbinator
I remember there was a "never" condition, which was present just for
completeness; it turns out ARM eventually found that having 2^28 different
NOPs would not be a good use of opcode space, so it's now a special extension
for newer instructions...

~~~
kybernetikos
I seem to recall from my ARM Assembler coding days that there was _also_ a
noop instruction, which of course could be conditional itself, so if you
didn't actually want to do the NOOP, you could do NOOP-NE, which wouldn't do
anything twice over.

~~~
talideon
From my days coding ARM assembly on the Acorn Archimedes, NOP was typically an
alias for MOV R0,R0 (which effectively did nothing) rather than being its own
instruction.

~~~
danellis
And if you ever needed to manually patch in an easy-to-remember NOP,
0x00000000 was ANDEQ R0, R0, R0.

------
AshleysBrain
Very cool and clever scheme. But what happens to immediates that can't be
encoded that way?

~~~
cnvogel
Here's what you can do to add a "complicated" constant stored elsewhere, in
hand-crafted assembler:

    
    
        add_something:        ; function starts here (argument r0 == some number)
    	ldr r1, __tmp     ; get complicated constant, store in r1
    	add r0, r0, r1    ; do the addition r0 = r0 + r1
    	bx lr            ; == return result (in r0)
        __tmp:
    	.word 0x12345678  ; store complicated constant here
    

You can play with your compiler, if you call gcc as "gcc -Os -S -o- file.c" if
will spit out generated assembler code (-S) on stdout "-o-" for the c-code in
file.c.

(but then, gcc prefers to have 4 "compact" adds, instead of loading a
constant...)

    
    
        $ cat dummy.c
        int
        add_random_number(int a)
        {
                return a + 0x12345678; /* guaranteed to be random */
        }
    
        $ arm-none-eabi-gcc -S -o- -Os dummy.c
        (...)
        add_random_number:
                @ Function supports interworking.
                @ args = 0, pretend = 0, frame = 0
                @ frame_needed = 0, uses_anonymous_args = 0
                @ link register save eliminated.
                add     r0, r0, #301989888
                add     r0, r0, #3424256
                add     r0, r0, #5696
                add     r0, r0, #56
                bx      lr

~~~
yuubi
In at least the ADT assembler and gas a few years back, the assembler provided
syntactic sugar:

    
    
        ldr r0,=0x123456578
    

assembles to

    
    
        ldr r0,pc+xxx
        ... and then somewhere later ...
        .word 0x12345678
    

The assembler had some default places it would put constant pools (end of a
module?), or you could explicitly tell it to generate a constant pool if the
default place would be outside the limit of the pc-relative addressing mode.

------
thrownaway2424
Article begins by praising RISC as "elegant" and "a good design decision",
goes on to describe limitations of RISC immediate values.

~~~
talideon
You're being a little disingenuous. It describes the _ARM_ architecture as
'elegant', and 1/4 of the way through the article describes the common use of
fixed length instructions in various RISC architectures as 'a good design',
and explains why this is mostly a win. And _then_ it explains how the ARM
manages to encode a wide range of useful immediate values using a very elegant
and simple scheme.

In all my years coding in ARM assembly language, the range of immediate values
it supported was rarely if ever an issue.

------
ggchappell
I agree that this clever and useful, but I get the feeling that it could have
been more so.

I haven't done much assembly in a while, but I was heavily into it once upon a
time, and I recall that values with lots of 1s were useful. There is not quick
way to generate those here. This means that we can write a single instruction
to set any single bit using an inclusive OR and the proper immediate value,
but we cannot write a single instruction to _clear_ any single bit.

The reason I think a bit more cleverness might have helped is that there are
so many values with multiple encodings. Anything where the 8-bit value ends
with 0 has a different encoding as well. For example, a rotation of 0000 and
an 8-bit value of 00000100 gives the same result as a rotation of 1111 and an
8-bit value of 00000001 (right?). Perhaps some of the redundant instructions
could have been used to represent things ending in lots of 1s?

Regardless, an interesting and informative post. :-)

~~~
eckzow
Clearing an arbitrary bit actually _is_ supported (as mentioned in the
article: "you can set, clear, or toggle any bit with one instruction").

The specific details require you to dig a bit past explaining just the
immediate encoding, but in the clearing case there's a dedicated instruction
for clearing the bit specified by the immediate:

    
    
      BIC - Bit Clear (immediate) performs a bitwise AND
      of a register value and the complement of an immediate
      value, and writes the result to the destination register.
    

As I mentioned in my other post, zero-rotation encodings _are_ gamed out as
well (to allow byte repetition).

~~~
ggchappell
> BIC - Bit Clear (immediate)

Ah, didn't catch that.

------
gumby
I love that the author described the ARM as "elegant, pragmatic, and quirky."
It reminds me of Gordon Bell's PCP-6/PDP-10 architecture, but applied to the
RISC rather than CISC philosophy.

(well, the PDP-10 was pretty RISC for its day and gave us things like BLT,
hence bitblit).

------
sbanach
So arm compilers must prefer to, for example, XOR with 0x10000000 rather than
AND with 0xEFFFFFFF?

~~~
stephencanon
AND with 0xefffffff would become BIC ("bit clear", aka "and not") with
0x10000000. But yes, the basic idea of your comment is correct. Fortunately,
compilers are quite good at this sort of thing.

~~~
cnvogel
And if you need something like ~(byte << N) for any purpose there's still MVN
(move-not).

    
    
         MVN r0, #0x10000000 ; ro = 0xefffffff
    

Oh, the joy :-).

~~~
stephencanon
CMN is the real gem. It's an endless source of bugs in the time between when
compiler writers discover it and when they figure out how it actually sets
flags. ARM would have done well to provide a "here's how you actually use this
instruction" guide in the architecture reference manual.

------
lectrick
This is clever. However...

The problem as I see it with this sort of cleverness is that it's difficult to
optimize to this. It leads to quite variable best-case and worst-case
scenarios and general unpredictability. As, say, a C programmer and not a
compiler designer, you might unintentionally pick lots of values that won't
fit into the "immediate" scheme (worst-case). Or you might force your design
to use numbers that DO fit into this scheme (best-case, but a bleed of lower-
level design decisions affecting higher-level design decisions).

~~~
fidotron
These days if what you're writing is in C then you should really know the
behaviour of the architecture underneath.

ARM really isn't that hard if you're already thinking like a low level
programmer. Things like MIPS were (better) designed for being targeted from
higher level languages, but the consequence is a much messier machine
language. It's always struck me as amusing that the conventional view is MIPS
is minimal, when ARM is really much more so, but it's from outside the
Berkeley/Stanford RISC bubble so didn't really get on to their radars for some
time.

~~~
TorKlingberg
There is lots of C code that is written to be portable, and trusts the
compiler to generate optimized assembly. But those are probably ok with
spending a couple of extra cycles for some immediate values.

------
caprad
Why is this cool and clever? It still only encodes 12 bits of data, it is just
different to using the normal 12 bits of data.

Is this a more useful subset? I am guessing it is so, since they went to this
trouble.

~~~
rbanffy
It can generate 4096 different slightly useful 32-bit values. For others, you
may assemble them using a variety of methods.

~~~
tasty_freeze
It generates fewer than 4096 unique values, as some values map to the same
thing. The most obvious being that 0 shifted any number of places produces 0,
but also 0x10 shifted left 0 is the same as 0x04 shifted left 2 is the same as
0x01 shifted left 4.

~~~
rbanffy
Makes sense. Well... It produces less than 4096 interesting 32-bit values.

------
userbinator
If I've interpreted this correctly, it means that all values between 256 and
65535 can't be encoded this way since they all have the form

0x0000nn00

with a nonzero nn, and those are the bits that can't be gotten to from
rotating.

~~~
stephencanon
Anything of the form 0x0000nn00 is representable, as it has at most eight
contiguous non-zero bits starting at an even bit position (i.e. these values
are 0x000000nn rotated right by 24). Maybe you're wondering how a rotation of
24 can be encoded in four bits? To get the rotation amount, you _double_ the
value of the four-bit field. Only even rotation counts are representable.

~~~
userbinator
Thanks, overlooked that detail. Very interesting encoding scheme.

------
renox
Nice but somewhat obsolete I think: AFAIK the ARMv8 (64bit) ISA is different..

------
notinreallife
THAT Barrel Shifter, son!

