
The Art of Picking Intel Registers (2003) - nkurz
http://www.swansontec.com/sregisters.html
======
wladimir
Oh this fills me with a mix of nostalgia and dread. The first time I saw a
'flat' instruction set such as MIPS and ARM, with a large file of registers
that are treated the same, was really refreshing.

~~~
RodgerTheGreat
I'd imagine that examining an efficient stack architecture like the J1[1]
would be even more refreshing. No need for register allocation and tremendous
gains in code density. You lose the ability to do out-of-order execution, but
in many situations independent operations can still be combined into a single
clock cycle, and the CPU is still faster because it simply has less work to
do.

[1] <http://www.excamera.com/sphinx/fpga-j1.html>

~~~
johntb86
Isn't that even worse? Now, you have to reorder your instructions to make them
more efficient and avoid exchange instructions.

~~~
snogglethorpe
Some stack architectures (e.g. Hobbit) use small offsets to allow addressing
into the stack, which adds some of the flavor and flexibility of a register-
based machine while still conceptually keeping all values on the stack.

------
bitL
In GCC you can actually offload the decision about which registers to choose
if you are writing inline assembly code. You just specify what type of
registers you need and GCC optimizes the assignments for you. Then you
reference them with % notation - e.g. movd %1, %2. Imagine you provide this as
an inline function in C++; all the calls for simple functions like arithmetic
operations will be optimized on the spot instead of building stacks or moving
some value to/from eax... It sped up my 3D software renderer more than twice!
:)

~~~
pbsd
You can also do the reverse. By adding __asm__("%ecx") to a variable
declaration, you assign it to that register, while being able to perform C
code as usual.

This usually results in degradation performance-wise, but has its uses.

------
csense
If you're programming 16-bit assembly language, it's even more important to
keep in mind the advantages and restrictions of each register, because you can
only use BX, SI, DI, and BP for memory addressing.

Each register has its own personality and certain things it just does best. It
is a little sad to see that RISC has "won" and modern CPU's have dozens or
hundreds of registers which are just numbers.

~~~
marssaxman
How on earth is that _sad_? It's a great thing that we are finally leaving all
that old cruft in the past!

------
gchpaco
For reasons like this I usually claim that 32-bit x86 chips have _zero_ or at
most one general purpose registers. Everything else is full of weird junk like
this.

~~~
notaddicted
Also, there is the conceptual model of how an x86 computer works, and then how
it actually works is another thing, for example I think the Sandy Bridge
architecture has 160 physical integer registers, if I'm interpreting this
correctly: [http://www.anandtech.com/show/3922/intels-sandy-bridge-
archi...](http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-
exposed/3) .

~~~
wtallis
Right, but the oddities and general paucity of the architectural registers
makes it hard to actually put the large physical register file to use. Subtle
differences in instruction selection and ordering can make a big difference in
performance, eg. [http://stackoverflow.com/questions/15349308/using-
xmm0-regis...](http://stackoverflow.com/questions/15349308/using-
xmm0-register-and-memory-fetches-c-code-is-twice-as-fast-as-asm-
only-u/15349403#15349403)

~~~
ajross
That's an interesting example, but it doesn't have anything to do with
"putting the physical register file to use". Register renaming always helps as
long as your algorithm is dependency-free. It's probably one of the single
_best_ general purpose optimizations available to a modern CPU design.

The classic example of where register-poor ISAs hurt isn't about "subtle
differences" at all. It's the fact that there are still only 8 (for i386)
named registers, and so anything that needs to deal with a working set beyond
that needs to do spill/fill to memory, and you can't "rename" memory accesses
(though you can sort of cheat, as with the store forward optimization -- but
that doesn't work nearly so well as renaming does).

~~~
gchpaco
The conventional x86 way of doing it was to spill to "the stack" and I'm told
that their chips actually specially optimize that nowadays. But it's all very
weird, deep magic in a lot of ways.

~~~
brigade
The only stack-specific special optimization that's done is fusing the
decrement/increment of esp/rsp with the store µop. And that's done mainly
since push/pop are one byte opcodes, unlike general load/store.

Everything else is general memory optimizations that apply for everything like
the aforementioned store forwarding. It's still expensive if the CPU can't use
them (mismatched load/store size, incorrect speculation, etc.)

------
jejones3141
"As a review, all x86-family CPU's have 8 general-purpose registers."

Having participated in the writing of an x86 code generator for a compiler, I
wish to thank the author for a much-needed laugh from the above sentence.

------
pmelendez
This article is very interesting. The only thing that I would love to see on
it, is an example about how, using the registers according to the original
design, would lead to more compression in contrast to using the registers
freely.

~~~
picomancer
The author gives an example, you just have to fill in some blanks. To follow
along, on Linux you can use the NASM assembler (sudo apt-get install nasm on
Debian-like or Ubuntu-like systems).

Then you can copy-paste the assembly language from the article (I've changed
the spacing for readability and added dummy definitions for the names so it
will compile):

    
    
        ;demo1.asm
        source_address       equ 0x100
        destination_address  equ 0x200
        loop_count           equ 0x10
    
            mov esi, source_address
            mov edi, destination_address
            mov ecx, loop_count
        my_loop:
            lodsd
            ;Do some calculations with eax here.
            stosd
            loop my_loop
    

And assemble it, in your favorite shell type:

    
    
        nasm -l demo1.lst demo1.asm
    

Then here's my alternative implementation that uses different registers. You
can no longer use LODSD, STOSD or LOOP instructions since these instructions
only work if you chose the same registers as demo1.

    
    
        ;demo2.asm
        source_address       equ 0x100
        destination_address  equ 0x200
        loop_count           equ 0x10
    
            mov ebx, source_address
            mov edx, destination_address
            mov esi, loop_count
        my_loop:
            mov eax,[ebx]       ; these two instructions instead of stosd
            add ebx,4
            ;Do some calculations with eax here.
            mov [edx],eax       ; these two instructions instead of lodsd
            add edx,4
            dec esi             ; these two instructions instead of loop
            jnz my_loop
    

I get a demo1 of 24 bytes and a demo2 of 38 bytes. The demo1.lst and demo2.lst
files produced show how many bytes are taken up by each instruction. (And if
you get addresses in a crash dump, they can be used to track down the
corresponding source code line.)

If you want to actually _run_ these programs, nasm's default output (raw
machine language instructions) cannot be used by most OS's. (In DOS, you can
-- just rename to .COM. But a DOS target needs to tell NASM 'bits 16', to have
it emit the proper prefixes for those new-fangled 32-bit instructions, and
will crash without the DOS exit syscall, INT 0x20.) The magic incantations for
standalone Linux assembly language programs are here:

<http://blog.markloiseau.com/2012/04/hello-world-nasm-linux/>

The .o file produced by an intermediate step of the instructions at the above
link can be linked with C code (unless you use a Microsoft toolchain, in which
case you have to instruct nasm to output .obj instead). Then you can call
assembly language functions from C and vice versa. (Figuring out how to
retrieve your function's arguments in assembly language is very interesting
and will enlighten you about the implementation of high-level languages.) Most
"real-world" assembly code does this: Most of the program is written in C, and
only the functions that need the particular advantages of assembly language
are written in it.

------
drudru11
When I saw this, I got a little freaked out. I pulled up this same page for a
different reason today :-) I thought nkurz was snooping my logs :-)

At any rate, hopefully, one day soon. We will move away from the x86 and its
legacy. While some people believe that architecture at this level doesn't
matter... and they are mostly right... the x86 is still ugly.

------
fijal
I'm sure it's important for the demo scene (because the binary size is the
only factor), but this is largely irrelevant for modern architectures when
performance is considered.

