
Let's Write Some x86-64 - mbrubeck
http://nickdesaulniers.github.io/blog/2014/04/18/lets-write-some-x86-64/
======
userbinator
I think it's important to state that ABIs are just conventions, standards that
have been decided upon by mostly _high-level-language_ designers, and may not
reflect any inherent limitations of the machine itself. In other words, you'll
need to know them if you interface with other code written in a HLL, but if
you're using Asm, there's no need to stick to this rigidity _within_ the code
you write.

For example, the stack alignment restriction only came about because _some_
SSE instructions initially required that their operands be aligned in memory.
This is a rather odd restriction considering that x86 was always intelligent
about unaligned accesses (they used to be slower, but the penalty has almost
disappeared with the latest models), and later revisions of SSE added
instructions for unaligned access. But this doesn't affect any other
instructions - in particular, call/return don't care, so in your own code you
don't need to align the stack every time - it's only when you call into some
other code that requires/expects such a condition.

Ditto for calling conventions; they are defined just so you can interface with
other code, and not due to any inherent aspect of the instruction set. In fact
although there are explicit call/return instructions, nothing requires that
you use them, nor does the concept of a "function" even exist at this level.
This means you can write horribly convoluted "sphagetti code", but also do
what can't be done easily in a higher-level language with "unconventional"
control flow, like coroutines
([https://news.ycombinator.com/item?id=7676691](https://news.ycombinator.com/item?id=7676691)
).

Also, I believe that you should learn Asm not just so you can write the same
code a compiler could generate, but so you can see and exploit the full
potential of the machine. I feel this point is lost in a lot of material on
this subject, maybe because their authors also never realised this, and so
contribute to the belief that using Asm is extremely difficult and tedious
with very little gain. IMHO it's only when you "break out" of the restricted
way of thinking that HLLs impose, and see the machine for what it really is,
that you can truly see the cost-benefit tradeoff and what it means for code to
be efficient. Look at the demoscene 4k/64k productions for some inspiration.
:-)

~~~
SixSigma
An opportunity to mention that Go uses the Plan9 ABI which is different to GCC
C style ABI. Plan9 did away with previous ABI conventions to make it easier to
port code. Portability was built into Plan9 from the outset, rather than "ok,
it works for x86, how shall we #ifdef other CPUs into our code"

A simple overview is given in this recent article

[http://nelhagedebugsshit.tumblr.com/post/84342207533/things-...](http://nelhagedebugsshit.tumblr.com/post/84342207533/things-
i-learned-writing-a-jit-in-go)

~~~
4ad
Go doesn't use the Plan 9 calling convention, and there's not a single Plan 9
calling convention anyway. Go arguments are passed on the stack, and results
are on the stack too. The Plan 9 C compiler used by the Go toolchain passes
arguments on the stack, but the return value is in a register. On Plan 9, the
first argument is usually (but not always) in a register. All these are
different.

Go inherited and extended the Plan 9 toolchain, but the calling convention was
changed. The main reason Go returns on the stack is to support multiple return
values.

------
nkurz
Great article, and nice to have clear examples of calling printf() from x64!

 _x86-64 ABI point 3: Variadic functions need to have the number of vector
arguments specified in %al._

Did you find any articles that explain the rational behind the way "varargs"
argument passing is done? Or better specifications for it? It feels like a
total mishmash.

For example, _why_ is the number of vector arguments specified explicitly, but
the number of total arguments is not? It seems so obviously useful to pass
that total that you'd hope there would be a good reason not to do so.

And since printf() is using the format string to determine the number of
variables anyway, why can't it just count the number of vector arguments
itself? What does it do differently when the wrong number is given?

~~~
rayiner
See: [https://blog.nelhage.com/2010/10/amd64-and-
va_arg](https://blog.nelhage.com/2010/10/amd64-and-va_arg) ("To start, any
function that is known to use va_start is required to, at the start of the
function, save all registers that may have been used to pass arguments onto
the stack, into the 'register save area', for future access by va_start and
va_arg. This is an obvious step, and I believe pretty standard on any platform
with a register calling convention. The registers are saved as integer
registers followed by floating point registers. As an optimization, during a
function call, %rax is required to hold the number of SSE registers used to
hold arguments, to allow a varargs caller to avoid touching the FPU at all if
there are no floating point arguments.")

When va_start is used, it needs to save argument registers to the stack in the
prologue of the function. The program is free to use different conventions
(format string, sentinel value) to signal to the callee how many arguments
there are. But the code generated for va_start has no way of knowing what
convention the program happens to use.

~~~
nkurz
Great reference, thanks!

I guess it makes sense to keep a consistent argument passing ABI, but I still
find the answer quite sad: to preserve the ability to call functions without
prototypes, you pass the arguments in registers and then immediately write
them back to the stack.

Putting the number of vectors in %rax/%al seems at odds with the consistent
ABI argument. Once you are changing things to require this, it seems like you
might as well make some other useful changes as well: like passing the number
of arguments and skipping the register to stack conversion.

It would be nice if there was a contortion-free entry point to the x64
printf() that started with the values on the stack. Can vprintf(const char
*format, va_list ap) be used in this way instead? Is a 'va_list' just a block
of memory containing the arguments?

I guess I need study the article you referred to (and stare at the libc source
a while: [https://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-
commo...](https://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-
common/vfprintf.c;h=c4ff8334b206fdeb3ae1e97ebec203552ca6a1ff;hb=HEAD))

~~~
comex
> to preserve the ability to call functions without prototypes, you pass the
> arguments in registers and then immediately write them back to the stack.

"Back" to the stack is a big assumption. A lot of the time, there is no reason
an argument would be on the stack in the first place; or even if it is a
spilled variable, it usually wouldn't be in the right place (without some
fairly uselessly clever stack layout), so it's just a matter of whether the
caller or callee does the write.

------
drv
I prefer NASM-flavored Intel syntax over AT&T; this is mostly a matter of
taste, and AT&T is certainly more regular, but the NASM syntax for addressing
looks more natural to me:

    
    
      lea edi, [eax + myarray + ebx * 4]
    

Versus AT&T:

    
    
      lea myarray(%eax,%ebx,$4), %edi
    

I had to look up the order of the syntax to write this comment; you can
certainly memorize it, but why bother when it can be written more naturally? I
wrote the NASM example in the same order for ease of comparison, but it can be
shuffled around in any order as long as the expression simplifies to a valid
addressing mode.

When using AT&T syntax, I also find it cleaner to drop the size suffix (movq
vs mov) from the instruction whenever the size can be inferred from a
register. It is still necessary in some cases that are ambiguous (e.g. mov
(mem), $imm), so I suppose for teaching purposes, it might not hurt to always
include it.

By the way, the 32-bit x86 ABI is much simpler than x86-64, so if you're
learning assembly language for the first time, x86-64 might not be the easiest
place to start. x86-64 also has some unintuitive behavior like zero-extending
32-bit mov (mov ax, 0x1234 does not modify the high part of eax, but mov eax,
0x12345678 clears the top half of rax).

~~~
ndesaulniers
Yeah, someone brought up the almost the same exact example on the programming
subreddit. What I don't like about the x86 ABI is that it feels deprecated to
me. Like knowing 68k or MIPS feels useless to me, relative to knowing the ABIs
of more mainstream devices like x86-64 or ARM.

~~~
userbinator
x86 32-bit is far from "deprecated", you can do a lot with 32 bits already
(especially if you're writing in Asm).

MIPS may not be as visible as ARM but it's still used in a ton of embedded
devices - like routers.

------
litonico
This was _really_ helpful. I'm just getting started with assembly, and trying
to hack together simple programs has so far always resulted in a segfault
(8-byte offset, I know now!). Well and thoroughly written, too; I love
articles like this.

~~~
ndesaulniers
Well I wrote this article just for programmers like you! I'm glad you found it
useful. When I learned these four points I made, I knew I had to share. If you
had further questions, I'd be more than happy to follow up.

------
arjie
I know very little Assembly Language, but what I recall is that you can skip
the libc dependency on Linux if you just run the system call yourself.

So the tiny snippet at the top will become, for instance:

    
    
        .text
        .globl _main
        _main:
          subq $8, %rsp
          movq $0, %rdi
          movq $1, %rax
          movq $2, %rbx
          int $0x80
    

The last three lines choose the 'exit' system call¹, load the number 2 to be
its argument, and make the call. Then you can get an executable like so:

    
    
        as nothing.s -o nothing.o
    
        ld nothing.o -o nothing -e _main    
    

Then you can run it and check the return value.

¹
[http://docs.cs.up.ac.za/programming/asm/derick_tut/syscalls....](http://docs.cs.up.ac.za/programming/asm/derick_tut/syscalls.html)

~~~
yusyusyus
I would advise against interrupt-based syscalls in for x86 platforms in
linux... a lot of optimization work has been done for x86-32 via vDSO (and,
afaik, x86-64 interrupt-based syscalls are just there for compatibility
reasons, but I've been wrong before)

For linux x86-64 syscalls, proper way is to use "syscall" instruction. For
linux x86-32 stuff, best way is to make a call via vDSO with "call _$gs:0x10 "
(hopefully i didn't butcher AT&T syntax) so the kernel can dictate (via the
vDSO) the best way to actually get you to the point of performing the syscall.

and, to your point, I recommend that anyone disassemble their code compiled
with libc (and look at the differences between linux, os x, _bsd, whatever)...
it gets really interesting how much is added and how each OS handles it.

~~~
arjie
Awesome! Thank you for your advice.

Either way, I am proceeding with the rest by using what someone else
recommended I do, which is use `gcc -nostartfiles`. Looking at gcc in verbose
mode reveals an ld command line I could use to run his code without
modification:

    
    
        ld nothing.o -o nothing -e _main -dynamic-linker /lib64/ld-linux-x86-64.so.2 -lc
    

For some reason, `/lib/ld64.so.2` is not linked to `/lib64/ld-
linux-x86-64.so.2` even though `ld` on Linux seems to look there by default
(and fails because none of the mainstream distributions make that link).

------
kazagistar
Hahaha, the intro is (probably unintentionally) ironic. They talk about the
"compiler" getting in the way of what is truly happening down below. But
x86-64 is just as much of a intermediate language, and because of the madness
of microcode, reordering, caches, hyperthreading, and a hundred other things,
it is not all that good of a description of what is really happening inside
the processor.

~~~
ndesaulniers
Excellent point. Michael Abrash's Graphics Programming Black Book is filled
with absolutes about certain instructions or groups of instructions being
faster than others, but nowadays there are so many complex interactions like
the ones you describe. Unfortunately, I feel that those complexities are used
as justification to dissuade the learning of assembly language.

~~~
mischanix
It should actually be used as a justification to persuade: you can usually
write x86 assembly in any way that comes to mind and as long as data access
(load/store) is the same across these variants the CPU will take the same
number of cycles to execute them due to the crazy amount of optimization in
the modern architectures.

~~~
pbsd
Unless I'm misunderstanding what you are saying, this is not true. Instruction
selection and data dependencies play a big role in the performance of a
routine.

What is indeed true is that the x86(-64) ISA alone does not give you enough
information to predict performance accurately. Furthermore, due to the
interactions between the different subsystems (e.g, caches, OoO buffers, etc)
it is essentially impossible to determine performance with cycle-accuracy. But
it is still possible to have a first-order approximation, for a given
microarchitecture, of what is going on underneath the CISC mask.

------
mischanix
For more in-depth information on different ABIs (Clang isn't covered but
follows GCC pretty closely), see Manual #5 [1] of Agner Fog's optimization
manuals.

[1]
[http://agner.org/optimize/#manual_call_conv](http://agner.org/optimize/#manual_call_conv)

~~~
ndesaulniers
Nice recommendation. Someone mentioned it as well on proggit:
[http://www.reddit.com/r/programming/comments/24gpqp/lets_wri...](http://www.reddit.com/r/programming/comments/24gpqp/lets_write_some_x8664/)

------
middleclick
I have always been very interested in learning x86 (or _64) assembly. Is there
a proper guide I can follow say over the summer?

~~~
matt_d
Try "Practical x64 Assembly and C++":

\-
[http://www.whatsacreel.net76.net/asmtutes.html](http://www.whatsacreel.net76.net/asmtutes.html)

\-
[https://www.youtube.com/playlist?list=PL0C5C980A28FEE68D](https://www.youtube.com/playlist?list=PL0C5C980A28FEE68D)

Covers x86-64, MMX, SSE2/3/4, AVX.

The author, Chris Rose, has also written a free e-book: "Assembly Language
Succinctly" \-- (PDF)
[https://www.syncfusion.com/Content/downloads/ebook/Assembly_...](https://www.syncfusion.com/Content/downloads/ebook/Assembly_Language_Succinctly.pdf)

HTH!

------
eterps
x86-64 is quite nice, even reminds me somewhat of my 68000 days. I played with
x86-64 relative addressing a while ago. (see the 'edit' a bit lower for a
working example):

[http://stackoverflow.com/questions/3250277/how-to-use-rip-
re...](http://stackoverflow.com/questions/3250277/how-to-use-rip-relative-
addressing-in-a-64-bit-assembly-program)

------
markrages
Please don't put for loops in Makefiles.

~~~
ndesaulniers
because...

------
skewp
Learn assembly from someone who doesn't understand assembly! Get the real Comp
Sci I experience!

~~~
ndesaulniers
I think some of the best hackers are self taught. The US university system is
a racket, and I have 10 books I could recommend that would teach you more.

~~~
gdwatson
If you have recommendations for books on amd64 assembly -- ideally ones that
focus on running under a free Unix stack (Linux, FreeBSD, etc.) rather than a
Microsoft stack where the books address such things -- I'd love to see them. I
did a very little bit of playing with real mode x86 assembly back when MS-DOS
was my day-to-day OS, but that's a mite dated now.

