Hacker News new | past | comments | ask | show | jobs | submit login
Let's Write Some x86-64 (nickdesaulniers.github.io)
163 points by mbrubeck on May 1, 2014 | hide | past | web | favorite | 44 comments

I think it's important to state that ABIs are just conventions, standards that have been decided upon by mostly high-level-language designers, and may not reflect any inherent limitations of the machine itself. In other words, you'll need to know them if you interface with other code written in a HLL, but if you're using Asm, there's no need to stick to this rigidity within the code you write.

For example, the stack alignment restriction only came about because some SSE instructions initially required that their operands be aligned in memory. This is a rather odd restriction considering that x86 was always intelligent about unaligned accesses (they used to be slower, but the penalty has almost disappeared with the latest models), and later revisions of SSE added instructions for unaligned access. But this doesn't affect any other instructions - in particular, call/return don't care, so in your own code you don't need to align the stack every time - it's only when you call into some other code that requires/expects such a condition.

Ditto for calling conventions; they are defined just so you can interface with other code, and not due to any inherent aspect of the instruction set. In fact although there are explicit call/return instructions, nothing requires that you use them, nor does the concept of a "function" even exist at this level. This means you can write horribly convoluted "sphagetti code", but also do what can't be done easily in a higher-level language with "unconventional" control flow, like coroutines (https://news.ycombinator.com/item?id=7676691 ).

Also, I believe that you should learn Asm not just so you can write the same code a compiler could generate, but so you can see and exploit the full potential of the machine. I feel this point is lost in a lot of material on this subject, maybe because their authors also never realised this, and so contribute to the belief that using Asm is extremely difficult and tedious with very little gain. IMHO it's only when you "break out" of the restricted way of thinking that HLLs impose, and see the machine for what it really is, that you can truly see the cost-benefit tradeoff and what it means for code to be efficient. Look at the demoscene 4k/64k productions for some inspiration. :-)

An opportunity to mention that Go uses the Plan9 ABI which is different to GCC C style ABI. Plan9 did away with previous ABI conventions to make it easier to port code. Portability was built into Plan9 from the outset, rather than "ok, it works for x86, how shall we #ifdef other CPUs into our code"

A simple overview is given in this recent article


Go doesn't use the Plan 9 calling convention, and there's not a single Plan 9 calling convention anyway. Go arguments are passed on the stack, and results are on the stack too. The Plan 9 C compiler used by the Go toolchain passes arguments on the stack, but the return value is in a register. On Plan 9, the first argument is usually (but not always) in a register. All these are different.

Go inherited and extended the Plan 9 toolchain, but the calling convention was changed. The main reason Go returns on the stack is to support multiple return values.

This is very insightful. What do you do day to day for a living, if I may ask?

I've done a bit of everything, from web development to embedded systems and electronics; currently mostly the latter. Some reverse-engineering included.

That sounds familiar, though I'm more of the former. Not sure I'd enjoy the lower level stuff day to day, but I do enjoy knowing the full stack (all the way down to the transistors, digital system design, and physics) because it allows you to converse with a wide range of people with a diverse range of backgrounds.

Yeah when I had finally peeled back the layers all the way to asm I had this moment of enlightenment. Personally I think the time it took me to peel back those layers was wasted and I wish I had started with asm rather than a HLL. I think it would be much better for students to move on to HLLs once they see the need for automating certain things. Knuth was right to use a synthetic assembly language rather than a HLL for the The Art of Computer Programming.

Knuth's "Structured Programming with Go To Statements" is also a good read and explicitly points out how high-level languages make certain ways of expressing algorithms hard or impossible.

Great article, and nice to have clear examples of calling printf() from x64!

x86-64 ABI point 3: Variadic functions need to have the number of vector arguments specified in %al.

Did you find any articles that explain the rational behind the way "varargs" argument passing is done? Or better specifications for it? It feels like a total mishmash.

For example, why is the number of vector arguments specified explicitly, but the number of total arguments is not? It seems so obviously useful to pass that total that you'd hope there would be a good reason not to do so.

And since printf() is using the format string to determine the number of variables anyway, why can't it just count the number of vector arguments itself? What does it do differently when the wrong number is given?

See: https://blog.nelhage.com/2010/10/amd64-and-va_arg ("To start, any function that is known to use va_start is required to, at the start of the function, save all registers that may have been used to pass arguments onto the stack, into the 'register save area', for future access by va_start and va_arg. This is an obvious step, and I believe pretty standard on any platform with a register calling convention. The registers are saved as integer registers followed by floating point registers. As an optimization, during a function call, %rax is required to hold the number of SSE registers used to hold arguments, to allow a varargs caller to avoid touching the FPU at all if there are no floating point arguments.")

When va_start is used, it needs to save argument registers to the stack in the prologue of the function. The program is free to use different conventions (format string, sentinel value) to signal to the callee how many arguments there are. But the code generated for va_start has no way of knowing what convention the program happens to use.

Great reference, thanks!

I guess it makes sense to keep a consistent argument passing ABI, but I still find the answer quite sad: to preserve the ability to call functions without prototypes, you pass the arguments in registers and then immediately write them back to the stack.

Putting the number of vectors in %rax/%al seems at odds with the consistent ABI argument. Once you are changing things to require this, it seems like you might as well make some other useful changes as well: like passing the number of arguments and skipping the register to stack conversion.

It would be nice if there was a contortion-free entry point to the x64 printf() that started with the values on the stack. Can vprintf(const char *format, va_list ap) be used in this way instead? Is a 'va_list' just a block of memory containing the arguments?

I guess I need study the article you referred to (and stare at the libc source a while: https://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-commo...)

> to preserve the ability to call functions without prototypes, you pass the arguments in registers and then immediately write them back to the stack.

"Back" to the stack is a big assumption. A lot of the time, there is no reason an argument would be on the stack in the first place; or even if it is a spilled variable, it usually wouldn't be in the right place (without some fairly uselessly clever stack layout), so it's just a matter of whether the caller or callee does the write.

The GP is not alone in thinking that the x86-64 ABI is "a total mishmash"; it feels to me like the designers were deliberately avoiding optimisation, when an ABI is really one thing where any optimisation wouldn't ever be "premature" - after all, it's something that's going to be used by millions if not more pieces of software, so little things really add up.

For example, it's not necessary that all the args passed in registers need be written back to the stack by va_start --- ignoring for the moment the complication of structures and SSE registers, a va_list could just contain one field, which would initially hold an index, the "current register number" where the desired argument is stored. Then va_arg (which is now implemented as a compiler primitive) could generate code that checks this index, and if the index is less than the maximum number of arguments that can be passed in registers, resolve to an access to the particular register (here's where a "register-indirect" addressing mode would be really useful, since it could just use that index). Otherwise that field becomes a pointer into the stack like the traditional implementation. In other words, depending on that index, reading a scalar va_arg turns into reading a register or reading from the stack.

It's not that hard to extend this to accommodate structures; and I don't see any limiting reason on why the ABI couldn't pass structures partially in registers and partially on the stack. We just have to define the rules that let us do so, as a structure is only a concept of grouping data together, an HLL construct. There's nothing that restricts a structure to being always a contiguous set of registers or memory locations. Having to "reassemble" a structure in memory is only necessary if its address is taken, and then only if operations other than accessing its members via that address are performed; otherwise its components can be operated on independently, regardless of whether they're in registers or memory. Since the compiler knows where va_arg is used with a structure type, it can decide whether it really needs to generate the code to reassemble the structure, or only a series of scalar accesses. (I touch on this point somewhat here too: https://news.ycombinator.com/item?id=7683823 )

Extending this to the SSE regs is simpler than for structures, since this is merely another bank of registers that arguments can be read from: another field in va_list to hold the SSE register index/pointer.

I find it a bit odd that the ABI would define an actual implementation of varargs rather than only where the arguments will be expected to be, since how they're actually accessed should be outside the scope of an ABI spec; now every compiler vendor is tempted to just copy this (IMHO sub-optimal) implementation instead of thinking about how it could be done, and possibly coming up with something better.

I'm not above admitting to using printf debugging. ;) rayiner found some great answers to your question.

I think it's often the right tool. My personal flavor involves wrapping the printf() with a conditional that depends on the current function name and the value of an environment variable. I've been trying to figure out how best to write a generic macro for this in x64.

You might be interested in Josh Haberman's post on using GDB's "breakpoint command lists" as an alternative to actually putting the variadic calls into the assembly: http://blog.reverberate.org/2013/06/printf-debugging-in-asse...

I prefer NASM-flavored Intel syntax over AT&T; this is mostly a matter of taste, and AT&T is certainly more regular, but the NASM syntax for addressing looks more natural to me:

  lea edi, [eax + myarray + ebx * 4]
Versus AT&T:

  lea myarray(%eax,%ebx,$4), %edi
I had to look up the order of the syntax to write this comment; you can certainly memorize it, but why bother when it can be written more naturally? I wrote the NASM example in the same order for ease of comparison, but it can be shuffled around in any order as long as the expression simplifies to a valid addressing mode.

When using AT&T syntax, I also find it cleaner to drop the size suffix (movq vs mov) from the instruction whenever the size can be inferred from a register. It is still necessary in some cases that are ambiguous (e.g. mov (mem), $imm), so I suppose for teaching purposes, it might not hurt to always include it.

By the way, the 32-bit x86 ABI is much simpler than x86-64, so if you're learning assembly language for the first time, x86-64 might not be the easiest place to start. x86-64 also has some unintuitive behavior like zero-extending 32-bit mov (mov ax, 0x1234 does not modify the high part of eax, but mov eax, 0x12345678 clears the top half of rax).

There is no such thing as "the x86 ABI". You're probably referring to the GCC ABI or something. For x86-64 there is the AMD64/System V ABI but, again, it's not "the x86-64 ABI". Linux uses these, I don't know about other OSes. For ARM there are a few different ones in use, namely soft and hard float ABIs.

Personally I would say x86-64 is easier than x86 because of extra registers. The System V ABI is a bit more complicated because arguments are passed in registers, which is great if they all fit, but makes it more complicated if they don't (it does encourage you to try to make them fit, though). The ARM ABIs are very similar in this respect, though.

Yeah, someone brought up the almost the same exact example on the programming subreddit. What I don't like about the x86 ABI is that it feels deprecated to me. Like knowing 68k or MIPS feels useless to me, relative to knowing the ABIs of more mainstream devices like x86-64 or ARM.

x86 32-bit is far from "deprecated", you can do a lot with 32 bits already (especially if you're writing in Asm).

MIPS may not be as visible as ARM but it's still used in a ton of embedded devices - like routers.

This was _really_ helpful. I'm just getting started with assembly, and trying to hack together simple programs has so far always resulted in a segfault (8-byte offset, I know now!). Well and thoroughly written, too; I love articles like this.

Well I wrote this article just for programmers like you! I'm glad you found it useful. When I learned these four points I made, I knew I had to share. If you had further questions, I'd be more than happy to follow up.

I know very little Assembly Language, but what I recall is that you can skip the libc dependency on Linux if you just run the system call yourself.

So the tiny snippet at the top will become, for instance:

    .globl _main
      subq $8, %rsp
      movq $0, %rdi
      movq $1, %rax
      movq $2, %rbx
      int $0x80
The last three lines choose the 'exit' system call┬╣, load the number 2 to be its argument, and make the call. Then you can get an executable like so:

    as nothing.s -o nothing.o

    ld nothing.o -o nothing -e _main    
Then you can run it and check the return value.

┬╣ http://docs.cs.up.ac.za/programming/asm/derick_tut/syscalls....

I would advise against interrupt-based syscalls in for x86 platforms in linux... a lot of optimization work has been done for x86-32 via vDSO (and, afaik, x86-64 interrupt-based syscalls are just there for compatibility reasons, but I've been wrong before)

For linux x86-64 syscalls, proper way is to use "syscall" instruction. For linux x86-32 stuff, best way is to make a call via vDSO with "call $gs:0x10" (hopefully i didn't butcher AT&T syntax) so the kernel can dictate (via the vDSO) the best way to actually get you to the point of performing the syscall.

and, to your point, I recommend that anyone disassemble their code compiled with libc (and look at the differences between linux, os x, bsd, whatever)... it gets really interesting how much is added and how each OS handles it.

Awesome! Thank you for your advice.

Either way, I am proceeding with the rest by using what someone else recommended I do, which is use `gcc -nostartfiles`. Looking at gcc in verbose mode reveals an ld command line I could use to run his code without modification:

    ld nothing.o -o nothing -e _main -dynamic-linker /lib64/ld-linux-x86-64.so.2 -lc
For some reason, `/lib/ld64.so.2` is not linked to `/lib64/ld-linux-x86-64.so.2` even though `ld` on Linux seems to look there by default (and fails because none of the mainstream distributions make that link).

Hahaha, the intro is (probably unintentionally) ironic. They talk about the "compiler" getting in the way of what is truly happening down below. But x86-64 is just as much of a intermediate language, and because of the madness of microcode, reordering, caches, hyperthreading, and a hundred other things, it is not all that good of a description of what is really happening inside the processor.

Excellent point. Michael Abrash's Graphics Programming Black Book is filled with absolutes about certain instructions or groups of instructions being faster than others, but nowadays there are so many complex interactions like the ones you describe. Unfortunately, I feel that those complexities are used as justification to dissuade the learning of assembly language.

It should actually be used as a justification to persuade: you can usually write x86 assembly in any way that comes to mind and as long as data access (load/store) is the same across these variants the CPU will take the same number of cycles to execute them due to the crazy amount of optimization in the modern architectures.

Unless I'm misunderstanding what you are saying, this is not true. Instruction selection and data dependencies play a big role in the performance of a routine.

What is indeed true is that the x86(-64) ISA alone does not give you enough information to predict performance accurately. Furthermore, due to the interactions between the different subsystems (e.g, caches, OoO buffers, etc) it is essentially impossible to determine performance with cycle-accuracy. But it is still possible to have a first-order approximation, for a given microarchitecture, of what is going on underneath the CISC mask.

For more in-depth information on different ABIs (Clang isn't covered but follows GCC pretty closely), see Manual #5 [1] of Agner Fog's optimization manuals.

[1] http://agner.org/optimize/#manual_call_conv

Nice recommendation. Someone mentioned it as well on proggit: http://www.reddit.com/r/programming/comments/24gpqp/lets_wri...

Any ABI differences between Clang and GCC for externally visible functions is a bug. The system defines that ABI (SysV for Mac and Linux x86_64). Internally visible functions can use different ABIs that allow the compiler to perform more aggressive optimizations.

I have always been very interested in learning x86 (or _64) assembly. Is there a proper guide I can follow say over the summer?

Try "Practical x64 Assembly and C++":

- http://www.whatsacreel.net76.net/asmtutes.html

- https://www.youtube.com/playlist?list=PL0C5C980A28FEE68D

Covers x86-64, MMX, SSE2/3/4, AVX.

The author, Chris Rose, has also written a free e-book: "Assembly Language Succinctly" -- (PDF) https://www.syncfusion.com/Content/downloads/ebook/Assembly_...


The official manuals from intel are pretty good, better for reference but still good:


A lot of the articles I link to at the end should be helpful. I'm torn; the compiler manuals have so much useful info in them, but they suck when it comes to learning assembly for the first time. Maybe someone else has a good book that they would recommend? Maybe this [0] but with some of the updates in mind.

[0] http://www.amazon.com/Professional-Assembly-Language-Richard...

I recently worked my way through "Programming From the Ground Up" and found it to be a good introduction, but note that it's only x86 (32-bit). You can easily compile and link your code in 32-bit mode, but you'll be working with 32-bit registers and pointers, and some slightly outdated techniques for making syscalls and such. At first I tried translating the book's examples to x64, but found it was pretty painful to mix that task with the complexity of learning assembly.


I was lucky enough to find a copy of Guide to Assembly Language: A Concise Introduction by James T. Streib which has been immensely helpful. http://link.springer.com/book/10.1007/978-0-85729-271-1

I also stumbled on The Art of Assembly which may prove promising. http://www.ic.unicamp.br/~pannain/mc404/aulas/pdfs/Art%20Of%...

x86-64 is quite nice, even reminds me somewhat of my 68000 days. I played with x86-64 relative addressing a while ago. (see the 'edit' a bit lower for a working example):


Please don't put for loops in Makefiles.


Learn assembly from someone who doesn't understand assembly! Get the real Comp Sci I experience!

I think some of the best hackers are self taught. The US university system is a racket, and I have 10 books I could recommend that would teach you more.

If you have recommendations for books on amd64 assembly -- ideally ones that focus on running under a free Unix stack (Linux, FreeBSD, etc.) rather than a Microsoft stack where the books address such things -- I'd love to see them. I did a very little bit of playing with real mode x86 assembly back when MS-DOS was my day-to-day OS, but that's a mite dated now.

I would be pretty keen to see that list of books

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact